google-cloud-dataflowapache-beambeam-sql

BeamSQL Group By query problem with Float value


Tried to get the unique value from the BigQuery table using BeamSQL in Google Dataflow. Using Group By clause implemented the condition in BeamSQL (sample query below). One of the column has float data type. While executing the Job got below exceptions,

Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: org.apache.beam.sdk.coders.RowCoder@81d6d10 is not deterministic because: All fields must have deterministic encoding. Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: FloatCoder is not deterministic because: Floating point encodings are not guaranteed to be deterministic.

BeamSQL Query:

PCollection ST= mainColl.apply(SqlTransform.query("SELECT ID,ITEM,UNITPRICE FROM PCOLLECTION GROUP BY ID,ITEM,UNITPRICE"));

It would be good if some one help me to solve this issue.

Please note, if we remove the float column then BeamSQL query works fine.


Solution

  • This is indicating that you should not use floating point values (in this case probably UNITPRICE value) in the aggregation (group by) scheme, because their output is non-deterministic (i.e. it can change based on the precision change). For instance, consider this example:

    WITH
      data AS (
      SELECT 100 AS id, 'abc' as item, 0.3448473362800000001 AS unitprice
      UNION ALL
      SELECT 200 AS id, 'xyz' as item, 0.49300013 AS unitprice
      UNION ALL
      SELECT 500 AS id, 'pqr' as item, 0.67322332200000212 AS unitprice
    )
    select id, item, unitprice from data
    group by id, item, unitprice
    

    Output for this comes to be:

    100 abc 0.34484733628    
    200 xyz 0.49300013   
    500 pqr 0.6732233220000021
    

    in which, the unitprice values look a bit different.

    To avoid this, you can go two routes:

    Hope this helps.