Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id | text1 | text2 |
---|---|---|
1 | a b b | b c |
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id | text1 | text2 | all_unique_words | map1 | map2 | similarity |
---|---|---|---|---|---|---|
1 | a b b | b c | [a b c] | [1, 2, 0] | [0, 1, 1] | 0.63 |
How can we convert two texts into two such maps in presto? Thanks in advance!
Use split
to transform string into array and then depended on Presto version either use unnest
+histogram
trick or array_frequency
:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id | _col1 |
---|---|
1 | 0.6324555320336759 |
UPD
For Trino in the latest version array_histogram
function should be added.