pythonjsonpandaslarge-dataijson

Bypass memory error to read large JSON file in Python


I'm converting several JSON files into a CSV using the following code below, everything works as intended. However, when I am trying to convert a larger JSON file (ranging 2-4+ GB), it gives a MemoryError.

Current Code

import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
json_list = json.load(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
# Extract data from the defined key names
key_list = ['created', 'emails', 'identities']
json_list = [{k:d[k] for k in key_list} for d in json_list]
# Flatten and convert to a data frame
json_list_flattened = (flatten(d, '.') for d in json_list)
df = pandas.DataFrame(json_list_flattened)
# Drop unwanted columns
df.drop(df.filter(regex='identities.0.favorites|identities.0.likes').columns, axis=1, inplace=True)
# Export to CSV in the same directory with the original file name
export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)

After looking around online for similar issues, it looks like I can utilize ijson which will parse the large JSON file line-by-line instead of the whole file. You'll also see in my code that I only extract certain JSON keys to convert to a CSV (created, emails, identities).

I'm not sure the best way to implement it, but I figure it would go something like this in the beginning of my code:

import ijson
...
json_list = ijson.parse(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
...

I can't share the full JSON file I'm working with because it has sensitive info. But you can use this sample data below to test. It's only one record, but I wanted to show you the layout that I'm working with.

JSON Sample

[
    {
        "callId": "abc123",
        "errorCode": 0,
        "apiVersion": 2,
        "statusCode": 200,
        "statusReason": "OK",
        "time": "2020-12-14T12:00:32.744Z",
        "registeredTimestamp": 1417731582000,
        "UID": "_guid_abc123==",
        "created": "2014-12-04T22:19:42.894Z",
        "createdTimestamp": 1417731582000,
        "data": {},
        "preferences": {},
        "emails": {
            "verified": [],
            "unverified": []
        },
        "identities": [
            {
                "provider": "facebook",
                "providerUID": "123",
                "allowsLogin": true,
                "isLoginIdentity": true,
                "isExpiredSession": true,
                "lastUpdated": "2014-12-04T22:26:37.002Z",
                "lastUpdatedTimestamp": 1417731997002,
                "oldestDataUpdated": "2014-12-04T22:26:37.002Z",
                "oldestDataUpdatedTimestamp": 1417731997002,
                "firstName": "John",
                "lastName": "Doe",
                "nickname": "John Doe",
                "profileURL": "https://www.facebook.com/John.Doe",
                "age": 50,
                "birthDay": 31,
                "birthMonth": 12,
                "birthYear": 1969,
                "city": "City, State",
                "education": [
                    {
                        "school": "High School Name",
                        "schoolType": "High School",
                        "degree": null,
                        "startYear": 0,
                        "fieldOfStudy": null,
                        "endYear": 0
                    }
                ],
                "educationLevel": "High School",
                "favorites": {
                    "music": [
                        {
                            "name": "Music 1",
                            "id": "123",
                            "category": "Musician/band"
                        },
                        {
                            "name": "Music 2",
                            "id": "123",
                            "category": "Musician/band"
                        }
                    ],
                    "movies": [
                        {
                            "name": "Movie 1",
                            "id": "123",
                            "category": "Movie"
                        },
                        {
                            "name": "Movie 2",
                            "id": "123",
                            "category": "Movie"
                        }
                    ],
                    "television": [
                        {
                            "name": "TV 1",
                            "id": "123",
                            "category": "Tv show"
                        }
                    ]
                },
                "followersCount": 0,
                "gender": "m",
                "hometown": "City, State",
                "languages": "English",
                "likes": [
                    {
                        "name": "Like 1",
                        "id": "123",
                        "time": "2014-10-31T23:52:53.0000000Z",
                        "category": "TV",
                        "timestamp": "1414799573"
                    },
                    {
                        "name": "Like 2",
                        "id": "123",
                        "time": "2014-09-16T08:11:35.0000000Z",
                        "category": "Music",
                        "timestamp": "1410855095"
                    }
                ],
                "locale": "en_US",
                "name": "John Doe",
                "photoURL": "https://graph.facebook.com/123/picture?type=large",
                "timezone": "-8",
                "thumbnailURL": "https://graph.facebook.com/123/picture?type=square",
                "username": "john.doe",
                "verified": "true",
                "work": [
                    {
                        "companyID": null,
                        "isCurrent": null,
                        "endDate": null,
                        "company": "Company Name",
                        "industry": null,
                        "title": "Company Title",
                        "companySize": null,
                        "startDate": "2010-12-31T00:00:00"
                    }
                ]
            }
        ],
        "isActive": true,
        "isLockedOut": false,
        "isRegistered": true,
        "isVerified": false,
        "lastLogin": "2014-12-04T22:26:33.002Z",
        "lastLoginTimestamp": 1417731993000,
        "lastUpdated": "2014-12-04T22:19:42.769Z",
        "lastUpdatedTimestamp": 1417731582769,
        "loginProvider": "facebook",
        "loginIDs": {
            "emails": [],
            "unverifiedEmails": []
        },
        "rbaPolicy": {
            "riskPolicyLocked": false
        },
        "oldestDataUpdated": "2014-12-04T22:19:42.894Z",
        "oldestDataUpdatedTimestamp": 1417731582894,
        "registered": "2014-12-04T22:19:42.956Z",
        "regSource": "",
        "socialProviders": "facebook"
    }
]

Expected Outcome

As a result, I expect to run my working code on larger JSON files without it crashing and giving me a MemoryError.


Solution

  • In general, if you want to use ijson to reduce memory overheads, you'll need to be careful that the rest of your code doesn't introduce overheads as well. The best-case scenario would be that you translate a single item of your JSON object into a single line in your resulting CVS file, and that you do this iteratively. This would mean moving away from using using list comprehensions (which act on all the data at once) and not using a DataFrame (which again will hold all your contents at once).

    Regarding ijson usage: a cheap solution would be to use ijson.items to iterate over each object in your JSON document. In the best-case sceanrio I described above you would then remove the unnecessary fields, and convert that object into a CSV line. Something like:

    with open(path, 'rb') as fin:
        for obj in ijson.items(fin, 'item'):
            filter_object_and_turn_it_into_a_cvs_line(obj)
    

    If you still for some reason really need to keep using a DataFrame, you can at least try to do the data cleaning always as generator expressions before passing it to the DataFrame to avoid additional data copies (but remember that you end up loading most data in memory anyways):

    with open(path, 'rb') as fin:
        json_list = ijson.items(fin, 'item')
        key_list = ['created', 'emails', 'identities']
        json_list = ({k:d[k] for k in key_list} for d in json_list) # this was a list comprehension in the original code
        flattened = (flatten(d, '.') for d in json_list)
        df = pandas.DataFrame(json_list_flattened)