I am looking at using azure databricks and delta live tables to store and process financial order book data.
This could grow to a very large table over time, with potentially billions of rows and terabytes of memory.
This table will be queried, aggregated and processed using pyspark
There will be numeric data, price and qty and with the precision needed to 2 decimal places. I am wondering if there is much advantage to storing these values as integers with an offset vs as a double.
Mostly I am thinking about the processing of the data if there might be a reduction in compute costs. If there are any resources or papers where a comparison has been done or if anyone has experience with this it would be great to hear.
Compared to double
, int
has more advantages while using it.
First and main thing is you save lot of memory when you use int
with offset.
Here, i have 2 dataframe with 50 million records in each and i cached it, one is having a column of type double
and other int
.
Below is the memory used to store.
You can see here double store more than the int.
Next, when you apply math function while aggregating that does more computational overhead compared to int
type also you get rounding error, it can also be avoided.
Things you get problem when you use int
type is,
in future if you want to change precision you need to reprocess all the things.
While displaying the data you need to convert back to double there it creates computational overhead.
Next, you must be knowing to properly scale up or down when using offset because in double you just do direct calculation but in int
you need to scale up or down according to precision.
Example Let's say we have
price
(e.g., 123.45 dollars)quantity
(e.g., 678.90 units)and we store it like below
12345
(scaled by 100).67890
(scaled by 100).If you multiply both
Double result - 83810.20
Integer result - 838102050
Here, you need to scale the integer result accordingly with precision 2 keeping the scale factor 100.
So, you need to be carefully with this kind of calculations.