Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22465

PyBites: Case study: How to parse nested JSON

$
0
0

I was asked to help parse a JSON file that is delivered by the iTunes Store Customer Reviews API JSON endpoint. It is not so important how this API works or if there are better APIs for this. Instead, let’s assume that we found our favorite API to work with and that our request makes perfect sense and now we have to deal with the API’s response, JSON in this case. This article will guide you through the necessary steps to parse this JSON response into a pandas DataFrame. I will focus heavily on the concepts and code development and less on explaining each line of code. Ideally, you should be already familiar with at least a little Python and its standard data types, most importantly dictionaries.

First, I want to understand what I am dealing with and because the display of the JSON response is not so nice for the original URL, I use a JSON pretify tool like http://jsonprettify.com/.

This will give me the following reformatted JSON response

{
    "feed": {
        "author": {
            "name": {
                "label": "iTunes Store"
            },
            "uri": {
                "label": "http://www.apple.com/uk/itunes/"
            }
        },
        "entry": [
            {
                "author": {
                    "uri": {
                        "label": "https://itunes.apple.com/gb/reviews/id1413855597"
                    },
                    "name": {
                        "label": "VedantJM"
                    },
                    "label": ""
                },
                "updated": {
                    "label": "2022-05-31T14:20:49-07:00"
                },
                "im:rating": {
                    "label": "5"
                },
                "im:version": {
                    "label": "2.38"
                },
                "id": {
                    "label": "8727815893"
                },
                "title": {
                    "label": "Brilliant"
                },
                "content": {
                    "label": "Adonissss",
                    "attributes": {
                        "type": "text"
                    }
                },
                "link": {
                    "attributes": {
                        "rel": "related",
                        "href": "https://itunes.apple.com/gb/review?id=1500780518&type=Purple%20Software"
                    }
                },
                "im:voteSum": {
                    "label": "0"
                },
                "im:contentType": {
                    "attributes": {
                        "term": "Application",
                        "label": "Application"
                    }
                },
                "im:voteCount": {
                    "label": "0"
                }
            },
            ...

I’ve only shown the first author object of the entry list. So the JSON response is structured in the following way:

  • we have a single root element “feed”
  • this root element has only two children, “author” and “entry”, from which I am only interested in “entry”
  • “entry” is a list of objects and each object has a set of properties like “author”, “link” and ,”im:rating”
  • Each property is again a JSON object
  • The most simple property is an object with just a “label” key and a value.
  • More complex properties like “author” are again nested

Before I dive deeper in how to parse this nested structure, let me try pandas read_json() method first.

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

The output of this is the following table:

feed
author{‘name’: {‘label’: ‘iTunes Store’}, ‘uri’: {‘l…
entry[{‘author’: {‘uri’: {‘label’: ‘https://itunes….
icon{‘label’: ‘http://itunes.apple.com/favicon.ico’}
id{‘label’: ‘https://mzstoreservices-int-st.itun…
link[{‘attributes’: {‘rel’: ‘alternate’, ‘type’: ‘…
rights{‘label’: ‘Copyright 2008 Apple Inc.’}
title{‘label’: ‘iTunes Store: Customer Reviews’}
updated{‘label’: ‘2022-06-02T11:44:53-07:00’}

This is clearly not what I had in mind. The first problem I should eliminate is that pandas cannot possibly know that I am only interested in the “entry” list, so I will first fetch the JSON response, parse it into a dictionary and access the “entry” value:

import requests

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"

r = requests.get(url)

data = r.json()
entries = data["feed"]["entry"]

Thus, entries looks like this:

[{'author': {'label': '',
             'name': {'label': 'hdydgdbs'},
             'uri': {'label': 'https://itunes.apple.com/gb/reviews/id1351156521'}},
  'content': {'attributes': {'type': 'text'},
              'label': 'This meditation app is above all, it works and is '
                       'free, i reccomend it to everyone who wants to '
                       'meditate'},
  'id': {'label': '8730361700'},
  'im:contentType': {'attributes': {'label': 'Application',
                                    'term': 'Application'}},
  'im:rating': {'label': '5'},
  'im:version': {'label': '2.38'},
  'im:voteCount': {'label': '0'},
  'im:voteSum': {'label': '0'},
  'link': {'attributes': {'href': 'https://itunes.apple.com/gb/review?id=1500780518&type=Purple%20Software',
                          'rel': 'related'}},
  'title': {'label': 'Amazing app'},
  'updated': {'label': '2022-06-01T08:25:00-07:00'}},
  ...

Now, I can try pandas again. Note, that I no longer have a JSON string but a normal Python list, containing dictionaries. Therefore, I can directly use pandas DataFrame class:

df = pd.DataFrame(entries)

The first rows of this data frame looks as follows (df.head(3)):

authorupdatedim:ratingim:versionidtitlecontentlinkim:voteSumim:contentTypeim:voteCount
0{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-06-01T08:25:00-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8730361700’}{‘label’: ‘Amazing app’}{‘label’: ‘This meditation app is above all, i…{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}
1{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-05-31T14:20:49-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8727815893’}{‘label’: ‘Brilliant’}{‘label’: ‘Adonissss’, ‘attributes’: {‘type’: …{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}
2{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-05-31T08:25:36-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8726950116’}{‘label’: ‘Perfect’}{‘label’: ‘This app is the one for meditations…{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}

Much better but still not there yet. We have the correct columns and each row is indeed one entry from the entries list. However, all values are strings and, worse, a string representation of the inner dictionaries (and sometimes multiple nested dictionaries). I cannot work with data like this so we have to manually parse the list of entries, which I will explain next.

Looking again at the structure of the entries (see Listing “JSON response”), the strategy is simple: go through each entry, and as long as the value is a dictionary, concatenate the keys to a single column name and the final value is the value for this column and row.

Now a very crude first attempt could be to hardcode all attribute names like this:

parsed_data = defaultdict(list)

for entry in entries:
    parsed_data["author_uri"].append(entry["author"]["uri"]["label"])
    parsed_data["author_name"].append(entry["author"]["name"]["label"])
    parsed_data["author_label"].append(entry["author"]["label"])
    parsed_data["content_label"].append(entry["content"]["label"])
    parsed_data["content_attributes_type"].append(entry["content"]["attributes"]["type"])
    ... 
                    

This implementation might be naive and does not generalize at all to any other use case, but it is still a highly effective method to begin with because it forces you to explicitly state the JSON structure down to the last element. That this method works can be tested again with the pandas DataFrame class that can create a data frame from a dictionary that has a list of values for each column:

pd.DataFrame(parsed_data)

The output will be a data frame like this:

author_uriauthor_nameauthor_labelcontent_labelcontent_attributes_type
0https://itunes.apple.com/gb/reviews/id1351156521hdydgdbsThis meditation app is above all, it works and…text
1https://itunes.apple.com/gb/reviews/id1413855597VedantJMAdonisssstext
2https://itunes.apple.com/gb/reviews/id1413779831dtnvcgiifghThis app is the one for meditations, great sel…text

However, aiming at a more general solution that can deal automatically with all attributes/properties without knowing the structure (but relying on the fact that there are only two levels of nested dictionaries, at least for now), I derived at the following solution:

parsed_data = defaultdict(list)

for entry in entries:
    for key, val in entry.items():
        for subkey, subval in val.items():
            if not isinstance(subval, dict):
                parsed_data[f"{key}_{subkey}"].append(subval)
            else:
                for att_key, att_val in subval.items():
                    parsed_data[f"{key}_{subkey}_{att_key}"].append(att_val)

The code is not the most beautiful one but I will come to this later. For now let’s focus on the intend: For each entry I look at the first key-value pair, knowing that value is always a dictionary (object in JSON). Now I have to deal with two different cases. In the first case, the value dictionary is flat and does not contain another dictionary, only key-value pairs. This is the simple case in which I combine the outer key with the inner key to a column name and take the value as column value for each pair. In the second case, the dictionary contains a key-value pair where the value is again a dictionary. I rely on the fact that there are at most two levels of nested dictionaries so I iterate over the key-value pairs of the inner dictionary and again combine the outer key and the most inner key to a column name and take the inner value as column value.

This procedure gives me a dictionary where the keys are the column names of the data frame and each key has a list as value with the row values for this column. This is the perfect format for the pandas DataFrame class to create a data frame from:

df = pd.DataFrame(parsed_data)
df.head()

And the first rows look like this:

author_uri_labelauthor_name_labelauthor_labelupdated_labelim:rating_labelim:version_labelid_labeltitle_labelcontent_labelcontent_attributes_typelink_attributes_rellink_attributes_hrefim:voteSum_labelim:contentType_attributes_termim:contentType_attributes_labelim:voteCount_label
0https://itunes.apple.com/gb/reviews/id1351156521hdydgdbs2022-06-01T08:25:00-07:0052.388730361700Amazing appThis meditation app is above all, it works and…textrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication0
1https://itunes.apple.com/gb/reviews/id1413855597VedantJM2022-05-31T14:20:49-07:0052.388727815893BrilliantAdonisssstextrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication0
2https://itunes.apple.com/gb/reviews/id1413779831dtnvcgiifgh2022-05-31T08:25:36-07:0052.388726950116PerfectThis app is the one for meditations, great sel…textrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication0

And there it is! I have a few more columns than originally expected because I decided to keep every bit of information by flattening the nested structure of dicts into a single dict where each combination of attributes is preserved by concatenating the different keys into a single column name, separated by an underscore “_”. This data frame has 50 rows and 16 columns, which is in accordance with the original JSON response. If you dislike the additional “label” part in the column names, it is easy to get rid of it:

df.columns = [col if not "label" in col else "_".join(col.split("_")[:-1]) for col in df.columns]

Right now, all columns have the data type object, which is not ideal memory-wise, but does not have a huge impact as long as the data set is as small as this. However, I can change the dtype with a simple one-liner:

df["im:rating"] = df["im:rating"].astype(int)

Pandas .info() method confirms the cast:

RangeIndex: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   author_uri                      50 non-null     object
 1   author_name                     50 non-null     object
 2   author                          50 non-null     object
 3   updated                         50 non-null     object
 4   im:rating                       50 non-null     int32 
 5   im:version                      50 non-null     object
 6   id                              50 non-null     object
 7   title                           50 non-null     object
 8   content                         50 non-null     object
 9   content_attributes_type         50 non-null     object
 10  link_attributes_rel             50 non-null     object
 11  link_attributes_href            50 non-null     object
 12  im:voteSum                      50 non-null     object
 13  im:contentType_attributes_term  50 non-null     object
 14  im:contentType_attributes       50 non-null     object
 15  im:voteCount                    50 non-null     object
dtypes: int32(1), object(15)
memory usage: 6.2+ KB

To conclude this article, I want to improve the reusability of my code. The first obvious thing to do would be to extract the parsing logic into one or several functions with proper type annotation and docstring. However, this is not the focus of this article so I will leave this part to the more practically inclined reader.

Instead, I want to stress that my solution (Listing “advanced implementation”) breaks for deeper nested JSON structures. That is because I had to explicitly iterate over the inner dictionaries with a for loop for each dictionary. A better solution to such a problem is a recursive approach where we apply a divide-and-conquer paradigm to handle the complexity. In other words what I really intend to do is to go into each dictionary as long as there are inner dictionaries and once I reach the end, add all values as separate columns:

def recursive_parser(entry: dict, data_dict: dict, col_name: str = "") -> dict:
    """Recursive parser for a list of nested JSON objects
    
    Args:
        entry (dict): A dictionary representing a single entry (row) of the final data frame.
        data_dict (dict): Accumulator holding the current parsed data.
        col_name (str): Accumulator holding the current column name. Defaults to empty string.
    """
    for key, val in entry.items():
        extended_col_name = f"{col_name}_{key}" if col_name else key
        if isinstance(val, dict):
            recursive_parser(entry[key], data_dict, extended_col_name)
        else:
            data_dict[extended_col_name].append(val)

parsed_data = defaultdict(list)

for entry in entries:
    recursive_parser(entry, parsed_data, "")

df = pd.DataFrame(parsed_data)

Isn’t that a beauty! Like often when a recursive approach is more natural to the task at hand the recursive implementation is more readable and often shorter than the iterative approach. You can verify yourself that the data frame obtained by this approach is identical to the data frame obtained from the previous iterative solution.

There are of course other approaches. A common strategy is to flatten the original JSON by doing something very similar like we did here: pull out all nested objects by concatenating all keys and keeping the final inner value. If you change the original JSON like this you obtain a JSON that can be directly fed into pandas. There is even a module you can use right out of the box: flatten_json. But where would be the fun in this…

I hope you enjoyed following me on this little journey and as always I am open for your comments, discussions and questions.


Keep Calm and Code in Python!


Viewing all articles
Browse latest Browse all 22465

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>