Using Trino SQL (actually AWS Athena implementation of Trino), I want to compute safe hashs of arbitrary MAP columns. By "arbitrary" I mean MAP that may have other MAP as values for certain keys. Example where I SELECT 2 rows being the same MAP defined with different key ordering:
WITH test_data(some_map) AS (
VALUES
( -- Map with key order: 'a', 'b'
MAP(ARRAY [ 'a', 'b' ], ARRAY [ ARRAY [ 'x' ], ARRAY [ 'y' ] ])
),
( -- Map with key order: 'b', 'a'
MAP(ARRAY [ 'b', 'a' ], ARRAY [ ARRAY [ 'y' ], ARRAY [ 'x' ] ])
)
)
SELECT some_map
FROM test_data
Both entries represent the same object, therefore I want to compute the same hash for them.
Here is my current solution involving a first cast to JSON, then a serialization through json_format
before starting the actual hashing task:
SELECT
some_map,
xxhash64(
CAST(
json_format(
CAST(
some_map AS JSON
)
)
AS VARBINARY
)
) as hashed
FROM test_data
Empirically this trick seems to work because the Trino's internal representation of a MAP seems to be key-ordered ... but I cannot be sure of the hash function remaining consistent with these empirical observations.
Question is twofold:
In general - no. Maps are unordered data structures (though there are special cases like SortedMap
in Java or SortedDictionary
in C#, but AFAIK there is no such concept in Trino).
Moreover, there are cases when internal representation of data is translated as unordered (tough generally based on my experiments maps behave like ordered ones):
-- sample data
with dataset(map_data1, map_data2) as (
values (map_from_entries(array[('z', 1), ('a', 2), ('b', 3), ('c', 4)]),
map_from_entries(array[('c', 4), ('b', 3), ('a', 2), ('z', 1)]))
)
-- query
select *,
map_data1 = map_data2 are_equal
from dataset;
Which results in the following output for me (check out the z
key position):
map_data1 | map_data2 | are_equal |
---|---|---|
{a=2, z=1, b=3, c=4} | {a=2, b=3, z=1, c=4} | true |
And there is no documentation guarantying the order of keys in the map (I even found this bit which states that "Maps are not ordered"), so ordering of the data in the map is an implementation detail which can differ depended on number of factors.
There is one more note though, if you add cast to JSON to the previous code snippet:
select *,
CAST(map_data1 AS JSON) json_one,
CAST(map_data2 AS JSON) json_two
from dataset;
The result for JSON becomes sorted:
map_data1 | map_data2 | json_one | json_two |
---|---|---|---|
{a=2, z=1, b=3, c=4} | {a=2, b=3, z=1, c=4} | {"a":2,"b":3,"c":4,"z":1} | {"a":2,"b":3,"c":4,"z":1} |
Which clearly is an implementation detail, but for current one seems to be guaranteed if I have read the code correctly (not Trino developer here 😊). If you check out the MapToJsonCast implementation you will see the following code:
Map<String, Integer> orderedKeyToValuePosition = new TreeMap<>();
for (int i = 0; i < map.getSize(); i++) {
String objectKey = provider.getObjectKey(rawKeyBlock, rawOffset + i);
orderedKeyToValuePosition.put(objectKey, i);
}
See the variable name - orderedKeyToValuePosition
and usage of the java's TreeMap<>
which is:
A Red-Black tree based NavigableMap implementation
And
The map is sorted according to the natural ordering of its keys, or by a Comparator provided at map creation time, depending on which constructor is used.
In current Trino implementation it seems to be safe to use your query, but again - this is an implementation detail which might change without any notice. If you want to guarantee the behavior you might need to look into creating a user-defined function to calculate the hash (for example using Java, see the Create and deploy a UDF using Lambda doc or using Trino's UDFs).