apache-sparkapache-spark-sqldecimal

Controlling Decimal Precision Overflow in Spark


We are using Spark 2.4.x.

We have a precision loss for one of our division operations (69362.86 / 111862.86) Both of these values are defined as decimal(38,3) on the table. When run through beeline, it produces 0.620070504187002 but when run through spark it produces 0.6200710. As we can see, there is a decimal truncation with spark's result. Upon reading more, we stumbled upon the Spark story SPARK-29123. The comment ask us to set the parameter spark.sql.decimalOperations.allowPrecisionLoss to false to avoid precision loss. However, there is another comment in the same story that is warning us of null when the exact representation of the decimal value is not possible. The stack overflow thread doesn't talk about the warning mentioned in the 2nd comment. Setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false and running the computation (69362.86 / 111862.86) results in 0.620070504187002 which is good but we are concerned about the warning in the 2nd comment.

As per the rules laid out in the sourcecode the value of division's precision and scale is determined by the below formula.

Operation    Result Precision                        Result Scale
e1 / e2      p1 - s1 + s2 + max(6, s1 + p2 + 1)      max(6, s1 + p2 + 1)

As per these rules, My precision is (38 -3 +3 + max(6,3 +38 +1)) => 80 and scale is max(6,3 +38 +1) => 42. Since these are exceeding the default limit of 38 for both Precision and Scale, they are reduced to 38 and 6. One way to fix this decimal truncation is by using proper decimal precision and scale for the input columns. I think based on our data in our table, we can easily set input precision as 18 and scale as 5 for both of the columns involved in the division. In that case, the resultant precision will be 38 and 24. This is good enough precision and scale to represent our data without any noticeable truncation. But we can't do this manually for all the numeric columns in our space. So we are thinking of setting spark.sql.decimalOperations.allowPrecisionLoss to false at cluster level. We are interested in learning more about what are the situations the result will be NULL when we set this parameter to false but if this parameter was left at default, would have resulted in a value with precision loss.

Now my question is, in what are the situations setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to false will result in null but when left it at the default (true) we get some value with precision loss. Can you provide any example that I can use to reproduce on my end?. If we are not able to find such an example then, can we set this parameter to false at cluster level so that the arithmetic operations can produce better results?.


Solution

  • Found some examples where setting this parameter spark.sql.decimalOperations.allowPrecisionLoss to true or false produces different results. I have given 2 such examples below.

    From this analysis, I understood that there is no tolerance on the fractional portion of the decimal value when this parameter is set to false as the name suggests. However if the scale of the resulting arithmetic operation exceeds the default limit of 38, then the scale is reduced to 38. For the integer portion of the decimal value, there are no checks, If the integer value comes within the range of (precision-scale) then the proper value is returned otherwise the computation returns NULL.

    With this, we have decided to leave this parameter to its default true to avoid a situation where the decimal column is not defined as tight to the actual value as possible, and because of this, the arithmetic operation results in NULL.

    Case 1: enter image description here Case 2: enter image description here