I am writing a simulation which requires a backing database to store the results. The simulation writes a huge amount of data. For obvious performance reasons, I chose to try out a NoSQL database, specifically MongoDB. However, I'm a bit puzzled over my data model.
In relational world, the schema would translate to this:
The simulation work as the following. First we create configuration (maps to Simulation table) and specify scenarios and how many Realization to calculate. Then we start the simulation. The simulation creates realizations in a scenario (in parallel, so many realizations and calculated at the same time and inserted into the scenario the simulation is currently working on).
However, in NoSQL, specifically MongoDB, relations are bad and slow, so I should make use of embedded documents as much as possible. So I came up with this:
This model should give me the best performance when first calculating all realizations and THEN saving it to the database as a single insert (of Scenario).
However, for performance reasons, I want to insert a Realization into Scenario as soon as it is computed. Which would require updating the Scenario every time a realization is complited. Is this a bad idea ? It says on the MongoDB reference that when adding a embedded document into a parent document, the parent document is updated but there is a performance loss anyway.
Would it be faster not to embed Realization into Scenario but reference it ? How much performance would be lost when reading and aggregating the data later ? Any other pitfalls I should know ?
Thanks.
It depends how you will use the data - embedding can involve updating multiple documents, so is slow to write but reading is always one document only so will be fast. Referencing is the opposite - writing to a single document (fast) but reading multiple documents (slow).
Aside from potential limitations like reaching a maximum size for embedded documents, it just comes down to which type of performance is more important for your scenario.