HPR4104: Introduction to jq - part 1

Released Thursday, 25th April 2024

Good episode? Give it some love!

HPR4104: Introduction to jq - part 1

Thursday, 25th April 2024

Good episode? Give it some love!

Rate Episode

Introduction

This is the start of a short series about the JSON data format, and howthe command-line tool jqcan be used to process such data. The plan is to make an open series towhich others may contribute their own experiences using this tool.

The jq command is described on the GitHub page as follows:

jq is a lightweight and flexible command-line JSON processor

…and as:

jq is like sed for JSON data - you can useit to slice and filter and map and transform structured data with thesame ease that sed, awk, grep andfriends let you play with text.

The jq tool is controlled by a programming language(also referred to as jq), which is very powerful. Thisseries will mainly deal with this.

JSON (JavaScript ObjectNotation)

To begin we will look at JSON itself. It is defined onthe Wikipedia pagethus:

JSON is an open standard file format and datainterchange format that uses human-readable text to store and transmitdata objects consisting of attribute–value pairs and arrays (or otherserializable values). It is a common data format with diverse uses inelectronic data interchange, including that of web applications withservers.

The syntax of JSON is defined by RFC 8259 and byECMA-404.It is fairly simple in principle but has some complexity.

JSON’s basic data types are (edited from the Wikipedia page):

Number: a signed decimal number that may contain afractional part and may use exponential E notation, but cannot includenon-numbers. (NOTE: Unlike what I said in the audio,there are two values representing non-numbers: 'nan' andinfinity: 'infinity'.
String: a sequence of zero or more Unicode characters.Strings are delimited with double quotation marks and support abackslash escaping syntax.
Boolean: either of the values true orfalse
Array: an ordered list of zero or more elements, each ofwhich may be of any type. Arrays use square bracket notation withcomma-separated elements.
Object: a collection of name–value pairs where the names(also called keys) are strings. Objects are delimited with curlybrackets and use commas to separate each pair, while within each pairthe colon ':' character separates the key or name from itsvalue.
null: an empty value, using the wordnull

Examples

These are the basic data types listed above (same order):

42"HPR"true["Hacker","Public","Radio"]{ "firstname": "John", "lastname": "Doe" }null

jq

From the Wikipedia page:

jq was created by Stephen Dolan, and released in October2012. It was described as being “like sed for JSON data”. Support forregular expressions was added in jq version 1.5.

Obtaining jq

This tool is available in most of the Linux repositories. Forexample, on Debian and Debian-based releases you can install itwith:

sudo apt install jq

See the downloadpage for the definitive information about available versions.

Manual for jq

There is a detailed manual describing the use of the jqprogramming language that is used to filter JSON data. It can be foundat https://jqlang.github.io/jq/manual/.

The HPR statistics page

This is a collection of statistics about HPR, in the form of JSONdata. We will use this as a moderately detailed example in thisepisode.

A link to this page may be found on the HPR Calendar pageclose to the foot of the page under the heading Workflow.The link to the JSON statistics is https://hub.hackerpublicradio.org/stats.json.

If you click on this you should see the JSON data formatted for youby your browser. Different browsers represent this in differentways.

You can also collect and display this data from the command line,using jq of course:

$ curl -s https://hub.hackerpublicradio.org/stats.json | jq '.' | nl -w3 -s' '1 {2 "stats_generated": 1712785509,3 "age": {4 "start": "2005-09-19T00:00:00Z",5 "rename": "2007-12-31T00:00:00Z",6 "since_start": {7 "total_seconds": 585697507,8 "years": 18,9 "months": 6,10 "days": 2811 },12 "since_rename": {13 "total_seconds": 513726307,14 "years": 16,15 "months": 3,16 "days": 1517 }18 },19 "shows": {20 "total": 4626,21 "twat": 300,22 "hpr": 4326,23 "duration": 7462050,24 "human_duration": "0 Years, 2 months, 27 days, 8 hours, 47 minutes and 30 seconds"25 },26 "hosts": 356,27 "slot": {28 "next_free": 8,29 "no_media": 030 },31 "workflow": {32 "UPLOADED_TO_IA": "2",33 "RESERVE_SHOW_SUBMITTED": "27"34 },35 "queue": {36 "number_future_hosts": 7,37 "number_future_shows": 28,38 "unprocessed_comments": 0,39 "submitted_shows": 0,40 "shows_in_workflow": 15,41 "reserve": 2742 }43 }

The curl utility is useful for collecting informationfrom links like this. I have used the -s option to ensureit does not show information about the download process, since it doesthis by default. The output is piped to jq which displaysthe data in a “pretty printed” form by default, as you see. In this caseI have given jq a minimal filter which causes what itreceives to be printed. The filter is simply '.'. I havepiped the formatted JSON through the nl command to get linenumbers for reference.

The JSON shown here consists of nested JSON objects. Thefirst opening brace and the last at line 43 define the whole thing as asingle object.

Briefly, the object contains the following:

a number called stats_generated (line 2)
an object called age on lines 3-18; this objectcontains two strings and two objects
an object called shows on lines 19-25
a number called hosts on line 26
an object called slot on lines 27-30
an object called workflow on lines 31-34
an object called queue on lines 35-42

We will look at ways to summarise and reformat such output in a laterepisode.