pythonjsonpandasout-of-memoryanalysis

Python - How to stream large (11 gb) JSON file to be broken up


I have a very large JSON (11 gb) file that is too large to read into my memory. I would like to break it up into smaller files to analyze the data. I am currently using Python and Pandas for the analysis and I am wondering if there is some way to access chunks of the file so that it can be read into memory without crashing the program. Ideally, I would like to break the years worth of data into smaller manageable files that span about a week, however there isn't a constant data size, although it doesn't matter as much if they are a set interval.

Here is the data format

{
"actor" : 
{
    "classification" : [ "suggested" ],
    "displayName" : "myself",
    "followersCount" : 0,
    "followingCount" : 0,
    "followingStocksCount" : 0,
    "id" : "person:stocktwits:183087",
    "image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png",
    "link" : "http://stocktwits.com/myselfbtc",
    "links" : 
    [

        {
            "href" : null,
            "rel" : "me"
        }
    ],
    "objectType" : "person",
    "preferredUsername" : "myselfbtc",
    "statusesCount" : 2,
    "summary" : null,
    "tradingStrategy" : 
    {
        "approach" : "Technical",
        "assetsFrequentlyTraded" : [ "Forex" ],
        "experience" : "Novice",
        "holdingPeriod" : "Day Trader"
    }
},
"body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"entities" : 
{
    "chart" : 
    {
        "fullImage" : 
        {
            "link" : "http://charts.stocktwits.com/production/original_10047145.png"
        },
        "image" : 
        {
            "link" : "http://charts.stocktwits.com/production/small_10047145.png"
        },
        "link" : "http://stks.co/iDEB",
        "objectType" : "image"
    },
    "sentiment" : 
    {
        "basic" : "Bearish"
    },
    "stocks" : 
    [

        {
            "displayName" : "Bitcoin",
            "exchange" : "PRIVATE",
            "industry" : null,
            "sector" : null,
            "stocktwits_id" : 9659,
            "symbol" : "BCOIN"
        }
    ],
    "video" : null
},
"gnip" : 
{
    "language" : 
    {
        "value" : "en"
    }
},
"id" : "tag:gnip.stocktwits.com:2012:note/10047145",
"inReplyTo" : 
{
    "id" : "tag:gnip.stocktwits.com:2012:note/10046953",
    "objectType" : "comment"
},
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"object" : 
{
    "id" : "note:stocktwits:10047145",
    "link" : "http://stocktwits.com/myselfbtc/message/10047145",
    "objectType" : "note",
    "postedTime" : "2012-10-17T19:13:50Z",
    "summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
    "updatedTime" : "2012-10-17T19:13:50Z"
},
"provider" : 
{
    "displayName" : "StockTwits",
    "link" : "http://stocktwits.com"
},
"verb" : "post"
}

Solution

  • jq 1.5 has a streaming parser (documented at http://stedolan.github.io/jq/manual/#Streaming). In one sense it's easy to use, e.g. if your 1G file is named 1G.json, then the following command will produce a stream of lines, including one line per "leaf" value:

    jq -c --stream . 1G.json

    (The output is shown below. Notice that each line is itself valid JSON.)

    However, using the streamed output may not be so easy, but that depends on what you want to do :-)

    The key to understanding the streamed output is that most lines have the form:

    [ PATH, VALUE ]

    where "PATH" is an array representation of the path. (When using jq, this array can in fact be used as a path.)

    [["actor","classification",0],"suggested"]
    [["actor","classification",0]]
    [["actor","displayName"],"myself"]
    [["actor","followersCount"],0]
    [["actor","followingCount"],0]
    [["actor","followingStocksCount"],0]
    [["actor","id"],"person:stocktwits:183087"]
    [["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"]
    [["actor","link"],"http://stocktwits.com/myselfbtc"]
    [["actor","links",0,"href"],null]
    [["actor","links",0,"rel"],"me"]
    [["actor","links",0,"rel"]]
    [["actor","links",0]]
    [["actor","objectType"],"person"]
    [["actor","preferredUsername"],"myselfbtc"]
    [["actor","statusesCount"],2]
    [["actor","summary"],null]
    [["actor","tradingStrategy","approach"],"Technical"]
    [["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"]
    [["actor","tradingStrategy","assetsFrequentlyTraded",0]]
    [["actor","tradingStrategy","experience"],"Novice"]
    [["actor","tradingStrategy","holdingPeriod"],"Day Trader"]
    [["actor","tradingStrategy","holdingPeriod"]]
    [["actor","tradingStrategy"]]
    [["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
    [["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"]
    [["entities","chart","fullImage","link"]]
    [["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"]
    [["entities","chart","image","link"]]
    [["entities","chart","link"],"http://stks.co/iDEB"]
    [["entities","chart","objectType"],"image"]
    [["entities","chart","objectType"]]
    [["entities","sentiment","basic"],"Bearish"]
    [["entities","sentiment","basic"]]
    [["entities","stocks",0,"displayName"],"Bitcoin"]
    [["entities","stocks",0,"exchange"],"PRIVATE"]
    [["entities","stocks",0,"industry"],null]
    [["entities","stocks",0,"sector"],null]
    [["entities","stocks",0,"stocktwits_id"],9659]
    [["entities","stocks",0,"symbol"],"BCOIN"]
    [["entities","stocks",0,"symbol"]]
    [["entities","stocks",0]]
    [["entities","video"],null]
    [["entities","video"]]
    [["gnip","language","value"],"en"]
    [["gnip","language","value"]]
    [["gnip","language"]]
    [["id"],"tag:gnip.stocktwits.com:2012:note/10047145"]
    [["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"]
    [["inReplyTo","objectType"],"comment"]
    [["inReplyTo","objectType"]]
    [["link"],"http://stocktwits.com/myselfbtc/message/10047145"]
    [["object","id"],"note:stocktwits:10047145"]
    [["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"]
    [["object","objectType"],"note"]
    [["object","postedTime"],"2012-10-17T19:13:50Z"]
    [["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
    [["object","updatedTime"],"2012-10-17T19:13:50Z"]
    [["object","updatedTime"]]
    [["provider","displayName"],"StockTwits"]
    [["provider","link"],"http://stocktwits.com"]
    [["provider","link"]]
    [["verb"],"post"]
    [["verb"]]
    

    For further details, see the jq manual, and especially the section on the streaming parser in the jq FAQ