pysparkdatabricksazure-databricksdelta-live-tablesdata-lakehouse

Databricks Numeric Type comparsion (Int vs Double)


I am looking at using azure databricks and delta live tables to store and process financial order book data.

This could grow to a very large table over time, with potentially billions of rows and terabytes of memory.

This table will be queried, aggregated and processed using pyspark

There will be numeric data, price and qty and with the precision needed to 2 decimal places. I am wondering if there is much advantage to storing these values as integers with an offset vs as a double.

Mostly I am thinking about the processing of the data if there might be a reduction in compute costs. If there are any resources or papers where a comparison has been done or if anyone has experience with this it would be great to hear.


Solution

  • Compared to double, int has more advantages while using it.

    First and main thing is you save lot of memory when you use int with offset.

    Here, i have 2 dataframe with 50 million records in each and i cached it, one is having a column of type double and other int.

    Below is the memory used to store.

    enter image description here

    You can see here double store more than the int.

    Next, when you apply math function while aggregating that does more computational overhead compared to int type also you get rounding error, it can also be avoided.

    Things you get problem when you use int type is, in future if you want to change precision you need to reprocess all the things.

    While displaying the data you need to convert back to double there it creates computational overhead.

    Next, you must be knowing to properly scale up or down when using offset because in double you just do direct calculation but in int you need to scale up or down according to precision.

    Example Let's say we have

    and we store it like below

    If you multiply both

    Double result - 83810.20

    Integer result - 838102050

    Here, you need to scale the integer result accordingly with precision 2 keeping the scale factor 100.

    So, you need to be carefully with this kind of calculations.