sql-order-byprestorow-numberdeterministicnon-deterministic

how to do global sorting without a unique key in Presto


In my case, I have some hive tables, the partition column(dt) is the only column that every table contains.

I execute the sql below in hive

SELECT * FROM (
SELECT row_number() over(ORDER BY T.dt) as row_num,T.* FROM 
(select * from ods.test_table where dt='2021-09-06') as T) TT 
WHERE TT.row_num BETWEEN 1 AND 10

I get the same result every time.

But I execute the sql in Presto, the result is not the same. I think the root cause is my table lack of a unique key.

Is it possible to do a global query without unique key in Presto?


Solution

  • You are calculating row_number

    row_number() over(ORDER BY T.dt)
    

    and ORDER BY column is always the same dt='2021-09-06'. In this case row_number has non-deterministic behavior and can assign the same numbers to different rows from run to run.

    The fact that you are always getting the same results in Hive is a coincident, probably you always are running with exactly the same number of splits or even on single mapper, which runs single-threaded and producing results which look like deterministic. Presto may have different parallelism and it affects which rows are passed to the row_number first.

    You can try to change something in splits configuration to force more mappers or increase the data size and you will be able to reproduce non-deterministic behavior, many mappers running in parallel on heavy loaded cluster will execute with different speed and different rows will be passed to the row_number.

    To have deterministic results, you can add some columns to the ORDER BY which will determine the order of rows. If you have no such columns, then it means that you can have any number of full duplicates.

    Even if you do not have unique key, row_number will produce deterministic results if ALL columns are in the order by.

    Consider this dataset:

    Col1 Col2 Col3
    1    1    2
    1    1    2
    1    1    3
    1    1    3
    

    row_number() over(ORDER BY col1) as rn can produce all 4 rows ordered differently each run (let's suppose the dataset is very big one and there are many mappers are running concurrently, some mappers can finish faster, some can fail and restart). Of course, if you have such a small dataset and always processing it in single process, single threaded, the result will be the same, but in general, this is not how databases work.

    The same about row_number() over(ORDER BY col1, col2)

    But in case of row_number() over(ORDER BY col1, col2, col3) - you will always get the same dataset, guaranteed.

    So, the solution is to use as much order by columns as needed to determine the order of rows. In the worst case if you have full duplicates, all columns should be added to the ORDER BY, duplicates will be ordered together and the result will be deterministic.