hivehiveqlclouderaapache-tezhive-query

Hive Joins on String Performance


We are using hive 3.1.3 and we're running Queries on Tez engine (Hive on Tez) on Cloudera platform.

We are still in the beginning phase and we have a situation where we want to join several tables together using string columns.

for example

table 1

table 1

table 2

enter image description here

and we want join table 1 with table 2 using table1.id = table2.id (note that the ids are hexadecimal values)

Now one of the discussion we're making is that joining using string values will affect the performance of the queries and it is better to change these ID columns from string to integer. but we really do not have the luxury to do that.

My question is, if we changed the ID columns to integers, will we notice that much of a difference in the query performance (Specially Joins), is it worth the change from String to integer ? (our largest table is almost 600 million records)

Thank you in advance.


Solution

  • There is a long debate on efficiency of join between INT=INT or STRING=STRING.

    Long answer - The main logic of int join is, int arent case sensitive, have no extra space around it, have no possibility of operating system induced extra character in it, and int normally takes less space to store so join condition require less memory to store and process them.
    But modern DBMS systems can do join in much efficient way so even if you join on string and string length is not abnormally huge, its a perfect candidate to join on.
    Main important point should be- if you can get a unique join column which can be used to create 1:1 or 1:m join, use it irrespective of int or string.
    Give INT a preference if you have both because a string can be uppercase in one system and lowercase in another, can have spaces/special characters etc. in one system and not in another.