The shape of structured data

Extracting the structure of structured data from and data files separately from their data is a powerful tool for data analysis, data manipulation and data migration. The process of extracting the structure of structured data is also known as “scaffolding” or “skeletonizing” the data.

In this technical blog post, we will discuss how to extract the structure of structured data from JSON and YAML data files using the command line tools jq and yq.

First, let’s start with JSON data files. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is a text format that is completely language-independent but uses conventions that are familiar to programmers of the C family of languages, including Go, C, C++, C#, Java, JavaScript, Perl, Python, and many others.

To extract the structure of a JSON data file, we can use the command line tool jq. jq is a command-line tool for working with JSON data. It can be used to extract specific values, filter and transform the data, and query using XPath-like expressions.

The following jq command can be used to extract the structure of a JSON data file:

Sometimes as developers it can be useful to get an understanding of the shape of the data within a structured data file like JSON or YAML. Fortunately there are common tools for working with data like that. Often used are jq and yq respectively.

In order to process a data file, and extract only the structure itself on can use the “`jq command like:

jq 'walk(
	if type == "number" then 0
	elif type == "string" then ""
	elif type == "boolean" then false
	elif type == "null" then null
	elif type == "array" then
	reduce .[] as $item (null; . + $item)
	else . end
)' input.json

This command uses the `walk` function to iterate through all the elements in the JSON data file, and the `if` statement to check the type of each element. If the element is a number, it is replaced with 0, if it is a string, it is replaced with an empty string, if it is a boolean, it is replaced with false, and if it is null, it is replaced with the null value. If the element is an array, it uses the `reduce` function to flatten the array and combine all the elements into one.

It’s important to note that the above command will not preserve the hierarchy of the original json file, but it will provide an overview of the structure of the json file.

Now, let’s move on to YAML data files. YAML (YAML Ain’t Markup Language) is a human-readable data serialization format that is easy for humans to read and write and easy for machines to parse and generate. YAML is often used for configuration files, data exchange, and data serialization. To extract the structure of a YAML data file, we can use the command line tool yq in combination with jq. yq is a command-line tool for working with YAML data, similar to jq, it can be used to extract specific values, filter and transform the data, and query using XPath-like expressions.

The following command can be used to extract the structure of a YAML data file:

yq -o=json eval input.yaml | jq 'walk(
	if type == "number" then 0
	elif type == "string" then ""
	elif type == "boolean" then false 
	elif type == "null" then null 
	elif type == "array" then 
	reduce .[] as $item (null; . + $item) 
	else . end
)'

This command first uses yq to convert the YAML data file to JSON format, and then pipes the output to jq. The jq command then iterates through all the elements in the JSON data, using the same logic as in the JSON example.

It’s important to note that when using the yq command, the eval flag is used to load the yaml file, this flag is used to evaluate any expressions and variables in the yaml file.

In conclusion, extracting the structure of structured data from JSON and YAML data files is a powerful tool for data analysis, data manipulation, and data migration. The command line tools jq and yq, in combination with each other, can be used to extract the structure of JSON and YAML data files. It’s important to note that the above commands will not preserve the hierarchy of the original files, but it will provide an overview of the structure of the files.

In this section, we will walk through the process of downloading data from OpenWeatherMap’s Forecast API and processing it to produce its skeletonized data.

To download data from the Forecast API, we will again need to have an API key from OpenWeatherMap. Once we have the API key, we can use to make a GET request to the API, passing in the latitude, longitude and the API key as query parameters.

The following command is an example of how to use HTTPie to download the 5 day/3 hour forecast data for a location with latitude 43.6529613 and longitude -79.3837338:

http GET "https://api.openweathermap.org/data/2.5/forecast?lat=43.6529613&lon=-79.3837338&appid={YOUR_API_KEY}" 

This command makes a GET request to the OpenWeatherMap Forecast API, passing the query parameters lat=43.6529613, lon=-79.3837338 to specify the location and appid parameter with your API key to authenticate the request. The API will respond with a JSON object containing the 5 day/3 hour forecast data for the specified location.

Once we have the JSON data, we can use jq to extract its structure. The following command can be used to extract the structure of the JSON data:

http GET "api.openweathermap.org/data/2.5/forecast?lat=43.6529613&lon=-79.3837338&appid={YOUR_API_KEY}" | \
jq 'walk(
	if type == "number" then 0 
	elif type == "string" then "" 
	elif type == "boolean" then false 
	elif type == "null" then null 
	elif type == "array" then 
	reduce .[] as $item (null; . + $item) 
	else . end 
)' 

This command first makes the GET request to the OpenWeatherMap Forecast API and pipes the response to jq. The jq command then iterates through all the elements in the JSON data, using the same logic as before. Replacing numbers with 0, strings with empty strings, booleans with false and null with null. If the element is an array, it uses the `reduce` function to flatten the array and combine all the elements into one.

{
  "cod": "",
  "message": 0,
  "cnt": 0,
  "list": {
    "dt": 0,
    "main": {
      "temp": 0,
      "feels_like": 0,
      "temp_min": 0,
      "temp_max": 0,
      "pressure": 0,
      "sea_level": 0,
      "grnd_level": 0,
      "humidity": 0,
      "temp_kf": 0
    },
    "weather": {
      "id": 0,
      "main": "",
      "description": "",
      "icon": ""
    },
    "clouds": {
      "all": 0
    },
    "wind": {
      "speed": 0,
      "deg": 0,
      "gust": 0
    },
    "visibility": 0,
    "pop": 0,
    "sys": {
      "pod": ""
    },
    "dt_txt": "",
    "snow": {
      "3h": 0
    },
    "rain": {
      "3h": 0
    }
  },
  "city": {
    "id": 0,
    "name": "",
    "coord": {
      "lat": 0,
      "lon": 0
    },
    "country": "",
    "population": 0,
    "timezone": 0,
    "sunrise": 0,
    "sunset": 0
  }
}

It’s important to note that the above command will not preserve the hierarchy of the original json file, but it will provide an overview of the structure of the json file. With this skeletonized data, we can then easily identify the keys and nested keys of the JSON data, and use them to extract specific values, filter and transform the data, or query using XPath-like expressions.

It’s worth noting that the OpenWeatherMap API has a limit on the number of requests you can make per minute, it’s important to be mindful of this limit, especially when using the skeletonized data for further processing.

In conclusion, using command-line tools like HTTPie and jq, we can easily download and process data from OpenWeatherMap’s Forecast API to extract its structure. This can be useful for data analysis, data manipulation, and data migration. It’s important to be mindful of the API’s limitations and follow best practices when working with sensitive information such as API keys.

[ChatGPT Disclosure]
Post Tagged with , , , , ,

Comments & Responses

Leave a Reply

Your email address will not be published. Required fields are marked *