I have a BSON
file that comes from a mongoexport
of a database. Let's assume the database is todo
and the collection is items
. Now I want to load the data offline into my RN app. Since the collection may contain arbitrarily many documents (lets say 2 currently), I want to use a method to parse the file however many documents it contains.
I have tried the following methods:
bsondump
executable.We can convert the file to JSON
using a external command
bsondump --outFile items.json items.bson
But I am developing a mobile app, so invoking a third-party executable in shell command is not ideal. Plus, the output contains several lines of one-line JSON objects, so the output is technically not a correct JSON file. So parsing afterwards is not graceful.
deserialize
in js-bson
libraryAccording to the js-bson
documentation, we can do
const bson = require('bson')
const fs = require('fs')
bson.deserialize(fs.readFileSync(PATH_HERE))
But this raises an error
Error: buffer length 173 must === bson size 94
and by adding this option,
bson.deserialize(fs.readFileSync(PATH_HERE), {
allowObjectSmallerThanBufferSize: true
})
the error is resolved but only returns the first document. Because the documentation doesn't mention that this function can only parse 1-document collection, I wonder if there is some option that enables multiple document reading.
deserializeStream
in js-bson
let docs = []
bson.deserializeStream(fs.readFileSync(PATH_HERE), 0, 2, docs, 0)
But this methods requires a parameter of the document count (2 here).
bson-stream
libraryI am actually using react-native-fetch-blob
instead of fs
, and according to their documentation, the stream object does not have a pipe
method, which is the one-and-only method demonstrated in bson-stream
doc. So although this method does not require the number of documents, I am confused how to use it.
// fs
const BSONStream = require('bson-stream');
fs.createReadStream(PATH_HERE).pipe(new BSONStream()).on('data', callback);
// RNFetchBlob
const RNFetchBlob = require('react-native-fetch-blob');
RNFetchBlob.fs.readStream(PATH_HERE, ENCODING)
.then(stream => {
stream.open();
stream.can_we_pipe_here(new BSONStream())
stream.onData(callback)
});
Also I'm not sure about the above ENCODING
above.
I have read the source code of js-bson
and has figured out a way to solve the problem. I think it's better to keep a detailed record here:
Split documents by ourselves, and feed the documents to parser one-by-one.
Let's say the .json
dump of our todo/items.bson
is
{_id: "someid#1", content: "Launch a manned rocket to the sun"}
{_id: "someid#2", content: "Wash my underwear"}
Which clearly violates the JSON syntax because there isn't an outer object wrapping things together.
EDIT: I later encountered the term "JSON stream", which seems to describe such a format. This term is used by
jq
(the command line tool for JSON manipulation), although the term may indicate that those JSON documents are intended to come one-by-one in a stream.
The internal BSON is of similar shape, but it seems BSON allows this kind of multi-object stuffing in one file.
Then for each document, the four leading bytes indicates the length of this document, including this prefix itself and the suffix. The suffix is simply a 0 byte.
The final BSON file resembles
LLLLDDDDDDD0LLLLDDD0LLLLDDDDDDDDDDDDDDDDDDDDDD0...
where L
is length, D
is binary data, 0
is literally 0.
Therefore, we can develop a simple algorithm to get the document length, do the bson.deserialize
with allowObjectSmallerThanBufferSize
which will get a first document from buffer start, then slice off this document and repeat.
One extra thing I mentioned is encoding in the React Native context. The libraries dealing with React Native persistent seems to all lack the support of reading the raw buffer from a file. The closest choice we have is base64
, which is a string representation of any binary file. Then we use Buffer
to convert base64
strings to buffers and feed into the algorithm above.
deserialize.js
const BSON = require('bson');
function _getNextObjectSize(buffer) {
// this is how BSON
return buffer[0] | (buffer[1] << 8) | (buffer[2] << 16) | (buffer[3] << 24);
}
function deserialize(buffer, options) {
let _buffer = buffer;
let _result = [];
while (_buffer.length > 0) {
let nextSize = _getNextObjectSize(_buffer);
if (_buffer.length < nextSize) {
throw new Error("Corrupted BSON file: the last object is incomplete.");
}
else if (_buffer[nextSize - 1] !== 0) {
throw new Error(`Corrupted BSON file: the ${_result.length + 1}-th object does not end with 0.`);
}
let obj = BSON.deserialize(_buffer, {
...options,
allowObjectSmallerThanBufferSize: true,
promoteBuffers: true // Since BSON support raw buffer as data type, this config allows
// these buffers as is, which is valid in JS object but not in JSON
});
_result.push(obj);
_buffer = _buffer.slice(nextSize);
}
return _result;
}
module.exports = deserialize;
App.js
import RNFetchBlob from `rn-fetch-blob`;
const deserialize = require('./deserialize.js');
const Buffer = require('buffer/').Buffer;
RNFetchBlob.fs.readFile('...', 'base64')
.then(b64Data => Buffer.from(b64Data, 'base64'))
.then(bufferData => deserialize(bufferData))
.then(jsData => {/* Do anything here */})
The above method reads the files as a whole. Sometimes when we have a very large .bson
file, the app may crash. Of course one can change the readFile
to readStream
above and add various checks to determine if the current chunk contains an ending of a document. This can be troublesome, and we are actually re-writing the bson-stream
library!
So instead, we can create a RNFetchBlob
file stream, and another bson-stream
parsing stream. This brings us back to the attempt #4 in the question.
After reading the source code, the BSON parsing stream is inherited form a node.js Transform string. Instead of piping
, we can manually forward chunks and events from onData
and onEnd
to on('data')
and on('end')
.
Since bson-stream
does not support passing options to underlying bson
library calls, one may want to tweak the library source code a little in their own projects.