javascriptnode.jsmongodbbson

How to deserialize dumped BSON with arbitrarily many documents in JavaScript?


I have a BSON file that comes from a mongoexport of a database. Let's assume the database is todo and the collection is items. Now I want to load the data offline into my RN app. Since the collection may contain arbitrarily many documents (lets say 2 currently), I want to use a method to parse the file however many documents it contains.

I have tried the following methods:

  1. Use external bsondump executable.

We can convert the file to JSON using a external command

bsondump --outFile items.json items.bson

But I am developing a mobile app, so invoking a third-party executable in shell command is not ideal. Plus, the output contains several lines of one-line JSON objects, so the output is technically not a correct JSON file. So parsing afterwards is not graceful.

  1. Use deserialize in js-bson library

According to the js-bson documentation, we can do

const bson = require('bson')
const fs = require('fs')
bson.deserialize(fs.readFileSync(PATH_HERE))

But this raises an error

Error: buffer length 173 must === bson size 94

and by adding this option,

bson.deserialize(fs.readFileSync(PATH_HERE), {
    allowObjectSmallerThanBufferSize: true
})

the error is resolved but only returns the first document. Because the documentation doesn't mention that this function can only parse 1-document collection, I wonder if there is some option that enables multiple document reading.

  1. Use deserializeStream in js-bson
let docs = []
bson.deserializeStream(fs.readFileSync(PATH_HERE), 0, 2, docs, 0)

But this methods requires a parameter of the document count (2 here).

  1. Use bson-stream library

I am actually using react-native-fetch-blob instead of fs, and according to their documentation, the stream object does not have a pipe method, which is the one-and-only method demonstrated in bson-stream doc. So although this method does not require the number of documents, I am confused how to use it.

// fs
const BSONStream = require('bson-stream');
fs.createReadStream(PATH_HERE).pipe(new BSONStream()).on('data', callback);

// RNFetchBlob
const RNFetchBlob = require('react-native-fetch-blob');
RNFetchBlob.fs.readStream(PATH_HERE, ENCODING)
.then(stream => {
    stream.open();
    stream.can_we_pipe_here(new BSONStream())
    stream.onData(callback)
});

Also I'm not sure about the above ENCODING above.


Solution

  • I have read the source code of js-bson and has figured out a way to solve the problem. I think it's better to keep a detailed record here:

    Approach 1

    Split documents by ourselves, and feed the documents to parser one-by-one.

    BSON internal format

    Let's say the .json dump of our todo/items.bson is

    {_id: "someid#1", content: "Launch a manned rocket to the sun"}
    {_id: "someid#2", content: "Wash my underwear"}
    

    Which clearly violates the JSON syntax because there isn't an outer object wrapping things together.

    EDIT: I later encountered the term "JSON stream", which seems to describe such a format. This term is used by jq (the command line tool for JSON manipulation), although the term may indicate that those JSON documents are intended to come one-by-one in a stream.

    The internal BSON is of similar shape, but it seems BSON allows this kind of multi-object stuffing in one file.

    Then for each document, the four leading bytes indicates the length of this document, including this prefix itself and the suffix. The suffix is simply a 0 byte.

    The final BSON file resembles

    LLLLDDDDDDD0LLLLDDD0LLLLDDDDDDDDDDDDDDDDDDDDDD0...
    

    where L is length, D is binary data, 0 is literally 0.

    The algorithm

    Therefore, we can develop a simple algorithm to get the document length, do the bson.deserialize with allowObjectSmallerThanBufferSize which will get a first document from buffer start, then slice off this document and repeat.

    About encoding

    One extra thing I mentioned is encoding in the React Native context. The libraries dealing with React Native persistent seems to all lack the support of reading the raw buffer from a file. The closest choice we have is base64, which is a string representation of any binary file. Then we use Buffer to convert base64 strings to buffers and feed into the algorithm above.

    The code

    deserialize.js

    const BSON = require('bson');
    
    function _getNextObjectSize(buffer) {
        // this is how BSON 
        return buffer[0] | (buffer[1] << 8) | (buffer[2] << 16) | (buffer[3] << 24);
    }
    
    function deserialize(buffer, options) {
        let _buffer = buffer;
        let _result = [];
    
        while (_buffer.length > 0) {
            let nextSize = _getNextObjectSize(_buffer);
            if (_buffer.length < nextSize) {
                throw new Error("Corrupted BSON file: the last object is incomplete.");
            }
            else if (_buffer[nextSize - 1] !== 0) {
                throw new Error(`Corrupted BSON file: the ${_result.length + 1}-th object does not end with 0.`);
            }
    
            let obj = BSON.deserialize(_buffer, {
                ...options,
                allowObjectSmallerThanBufferSize: true,
                promoteBuffers: true // Since BSON support raw buffer as data type, this config allows
                // these buffers as is, which is valid in JS object but not in JSON
            });
            _result.push(obj);
            _buffer = _buffer.slice(nextSize);
        }
    
        return _result;
    }
    
    module.exports = deserialize;
    

    App.js

    import RNFetchBlob from `rn-fetch-blob`;
    const deserialize = require('./deserialize.js');
    const Buffer = require('buffer/').Buffer;
    
    RNFetchBlob.fs.readFile('...', 'base64')
        .then(b64Data => Buffer.from(b64Data, 'base64'))
        .then(bufferData => deserialize(bufferData))
        .then(jsData => {/* Do anything here */})
    

    Approach 2

    The above method reads the files as a whole. Sometimes when we have a very large .bson file, the app may crash. Of course one can change the readFile to readStream above and add various checks to determine if the current chunk contains an ending of a document. This can be troublesome, and we are actually re-writing the bson-stream library!

    So instead, we can create a RNFetchBlob file stream, and another bson-stream parsing stream. This brings us back to the attempt #4 in the question.

    After reading the source code, the BSON parsing stream is inherited form a node.js Transform string. Instead of piping, we can manually forward chunks and events from onData and onEnd to on('data') and on('end').

    Since bson-stream does not support passing options to underlying bson library calls, one may want to tweak the library source code a little in their own projects.