sqlhashamazon-athenaprestotrino

Safe hash of arbitrary map in Trino SQL / Athena


Using Trino SQL (actually AWS Athena implementation of Trino), I want to compute safe hashs of arbitrary MAP columns. By "arbitrary" I mean MAP that may have other MAP as values for certain keys. Example where I SELECT 2 rows being the same MAP defined with different key ordering:

WITH test_data(some_map) AS (
    VALUES
        ( -- Map with key order: 'a', 'b'
            MAP(ARRAY [ 'a', 'b' ], ARRAY [ ARRAY [ 'x' ], ARRAY [ 'y' ] ])
        ),
        ( -- Map with key order: 'b', 'a'
            MAP(ARRAY [ 'b', 'a' ], ARRAY [ ARRAY [ 'y' ], ARRAY [ 'x' ] ])
        )
)

SELECT some_map
FROM test_data

Both entries represent the same object, therefore I want to compute the same hash for them.

Here is my current solution involving a first cast to JSON, then a serialization through json_format before starting the actual hashing task:

SELECT 
    some_map,
    xxhash64(
        CAST(
            json_format(
                CAST(
                    some_map AS JSON
                )
            )
            AS VARBINARY
        )
    ) as hashed
FROM test_data

Empirically this trick seems to work because the Trino's internal representation of a MAP seems to be key-ordered ... but I cannot be sure of the hash function remaining consistent with these empirical observations.

Question is twofold:


Solution

  • is it someway a guarantee of keys in a MAP to be ordered ?

    In general - no. Maps are unordered data structures (though there are special cases like SortedMap in Java or SortedDictionary in C#, but AFAIK there is no such concept in Trino).

    Moreover, there are cases when internal representation of data is translated as unordered (tough generally based on my experiments maps behave like ordered ones):

    -- sample data
    with dataset(map_data1, map_data2) as (
        values (map_from_entries(array[('z', 1), ('a', 2), ('b', 3), ('c', 4)]),
               map_from_entries(array[('c', 4), ('b', 3), ('a', 2), ('z', 1)]))
    )
    
    -- query
    select *,
           map_data1 = map_data2 are_equal
    from dataset;
    

    Which results in the following output for me (check out the z key position):

    map_data1 map_data2 are_equal
    {a=2, z=1, b=3, c=4} {a=2, b=3, z=1, c=4} true

    And there is no documentation guarantying the order of keys in the map (I even found this bit which states that "Maps are not ordered"), so ordering of the data in the map is an implementation detail which can differ depended on number of factors.

    There is one more note though, if you add cast to JSON to the previous code snippet:

    select *,
           CAST(map_data1 AS JSON) json_one,
           CAST(map_data2 AS JSON) json_two
    from dataset;
    

    The result for JSON becomes sorted:

    map_data1 map_data2 json_one json_two
    {a=2, z=1, b=3, c=4} {a=2, b=3, z=1, c=4} {"a":2,"b":3,"c":4,"z":1} {"a":2,"b":3,"c":4,"z":1}

    Which clearly is an implementation detail, but for current one seems to be guaranteed if I have read the code correctly (not Trino developer here 😊). If you check out the MapToJsonCast implementation you will see the following code:

    Map<String, Integer> orderedKeyToValuePosition = new TreeMap<>();
    for (int i = 0; i < map.getSize(); i++) {
        String objectKey = provider.getObjectKey(rawKeyBlock, rawOffset + i);
        orderedKeyToValuePosition.put(objectKey, i);
    }
    

    See the variable name - orderedKeyToValuePosition and usage of the java's TreeMap<> which is:

    A Red-Black tree based NavigableMap implementation

    And

    The map is sorted according to the natural ordering of its keys, or by a Comparator provided at map creation time, depending on which constructor is used.

    if the first answer is no, what would be a safe way of computing hash of such MAP ?

    In current Trino implementation it seems to be safe to use your query, but again - this is an implementation detail which might change without any notice. If you want to guarantee the behavior you might need to look into creating a user-defined function to calculate the hash (for example using Java, see the Create and deploy a UDF using Lambda doc or using Trino's UDFs).