Is there a way to estimate row size if I know what kind of data I'll be storing (with compression in mind)?
I'm looking at something like
bson_id | string (max 200 chars) | int32 | int32 | int32 | bool | bool | DateTime | DateTime | DateTime | int32
I am trying to find the best DB solution for about 2 trillion records like the one above, combined with about x20 like
bson_id | bson_id
Any other recommendations are welcome
Sort for very generic answer.
As far as I know, only tests with dummy data is reliable way to measure such thing. “Dummy” here means fake but not repeated, because strong repetition may spoil compression estimates.
For example you may put 1m, 2m, 4m, 8m, 32m, 128m and so on… records and check is there any linear dependency. If it's linear, you can easily with some contingency extrapolate values for billions and trillions of records.
In such tests you also able to check performance against your needs. For example you can increase replication factor of HDFS to improve read performance.
And finally you can check this for compression viewpoint.
Good luck with BigData!