mongodbmongodumpmongorestore

bson vs gzip dump of mongodb


I have a Database with

On disk size 19.032GB (using show dbs command)

Data size 56 GB (using db.collectionName.stats(1024*1024*1024).size command)

While taking mongodump using command mongodump we can set param --gzip. These are the observations I have with and without this flag.

command timeTaken in dump size of dump restoration time observation
with gzip 30 min 7.5 GB 20 min in mongostat the insert rate was ranging from 30K to 80k par sec
without gzip 10 min 57 GB 50 min in mongostat the insert rate was very erratic, and ranging from 8k to 20k par sec

Dump was taken from machine with 8 core, 40 GB ram(Machine B) to 12 core, 48GB ram machine (Machine A). And restored to 12 core, 48 gb machine(Machine C) from Machine A to make sure there is no resource contention between mongo, mongorestore and mongodump process. Mongo version 4.2.0

I have few questions like

  1. What is the functional difference between 2 dumps?
  2. Can the bson dump be zipped to make it zip?
  3. how does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)
  4. Is there a way to make overall process faster. From these result I see that have we have to choose 1 between dump and restore speed.
  5. Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
  6. Will smaller dump help in overall time?

Update: 2. Can the bson dump be zipped to make it zip?

yes

% ./mongodump -d=test                                                                     
2022-11-16T21:02:24.100+0530    writing test.test to dump/test/test.bson
2022-11-16T21:02:24.119+0530    done dumping test.test (10000 documents)
% gzip dump/test/test.bson                                         
% ./mongorestore   --db=test8 --gzip dump/test/test.bson.gz
2022-11-16T21:02:51.076+0530    The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION}
2022-11-16T21:02:51.077+0530    checking for collection data in dump/test/test.bson.gz
2022-11-16T21:02:51.184+0530    restoring test8.test from dump/test/test.bson.gz
2022-11-16T21:02:51.337+0530    finished restoring test8.test (10000 documents, 0 failures)
2022-11-16T21:02:51.337+0530    10000 document(s) restored successfully. 0 document(s) failed to restore.

Solution

  • I am no MongoDB expert, but I have good experience working with MongoDB backup and restore activities and will answer to the best of my knowledge.

    1. What is the functional difference between 2 dumps?
    1. Can the bson dump be zipped to make it zip?
    1. How does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)

    In the Production backup environment of my company, indexing of keys takes more time compared to the restoration of data.

    1. Is there a way to make the overall process faster. From these result I see that have we have to choose 1 between dump and restore speed

    The solution depends...

    Choose whichever scenario works best for you, but you have to find the right balance between time and space primarily while deciding and secondly, whether to apply indexes or not.

    In my company, we usually take non-compressed backup and restore for P-1 and gzip compression for weeks old prod backups, and further manually compress it for backups that are months older.

    1. Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
    1. Will smaller dump help in overall time?

    Hope this helped you answer your questions. Let me know if you have any: