javascriptjsonnode.jsfirebasejsonstream

Dealing with a JSON object too big to fit into memory


I have a dump of a Firebase database representing our Users table stored in JSON. I want to run some data analysis on it but the issue is that it's too big to load into memory completely and manipulate with pure JavaScript (or _ and similar libraries).

Up until now I've been using the JSONStream package to deal with my data in bite-sized chunks (it calls a callback once for each user in the JSON dump).

I've now hit a roadblock though because I want to filter my user ids based on their value. The "questions" I'm trying to answer are of the form "Which users x" whereas previously I was just asking "How many users x" and didn't need to know who they were.

The data format is like this:

{
    users: {
        123: {
            foo: 4
        },
        567: {
            foo: 8
        }
    }
}

What I want to do is essentially get the user ID (123 or 567 in the above) based on the value of foo. Now, if this were a small list it would be trivial to use something like _.each to iterate over the keys and values and extract the keys I want.

Unfortunately, since it doesn't fit into memory that doesn't work. With JSONStream I can iterate over it by using var parser = JSONStream.parse('users.*'); and piping it into a function that deals with it like this:

var stream = fs.createReadStream('my.json');

stream.pipe(parser);

parser.on('data', function(user) {
    // user is equal to { foo: bar } here
    // so it is trivial to do my filter
    // but I don't know which user ID owns the data
});

But the problem is that I don't have access to the key representing the star wildcard that I passed into JSONStream.parse. In other words, I don't know if { foo: bar} represents user 123 or user 567.

The question is twofold:

  1. How can I get the current path from within my callback?
  2. Is there a better way to be dealing with this JSON data that is too big to fit into memory?

Solution

  • I went ahead and edited JSONStream to add this functionality.

    If anyone runs across this and wants to patch it similarly, you can replace line 83 which was previously

    stream.queue(this.value[this.key])
    

    with this:

    var ret = {};
    ret[this.key] = this.value[this.key];
    
    stream.queue(ret);
    

    In the code sample from the original question, rather than user being equal to { foo: bar } in the callback it will now be { uid: { foo: bar } }

    Since this is a breaking change I didn't submit a pull request back to the original project but I did leave it in the issues in case they want to add a flag or option for this in the future.