hadoophiverlike

Hive RLIKE on a JOIN synthax


I have two tables on hive. The first, called 'access', contains apache logs, where the first field is an complete ip adress :

10.4.5.12 - - [26/Jun/2010:11:16:09 +1000] "GET /myportal/pageA HTTP/1.1"
10.4.41.2 - - [26/Jun/2010:11:18:09 +1000] "GET /myportal/pageB HTTP/1.1"
10.5.1.111 - - [26/Jun/2010:11:22:09 +1000] "GET /myportal/pageA HTTP/1.1"
192.10.4.177 - - [26/Jun/2010:11:22:41 +1000] "GET /myportal/pageC HTTP/1.1"

and an other, called 'client' that contains the beginning of an ip range and a string :

10.4 clientA
10.5 clientB
10.7 ClientC

I would like to find the total of hits, by client, and display their name. So, I try to join this two tables like this :

SELECT client.name, count(access.ip) FROM access JOIN client WHERE access.ip RLIKE client.ip GROUP BY client.name;

It works but for the clientA, I get a hit for the last entry (192.10.4.177) of my apache log too, what I don't want. I would like to comparre client.ip only with the beginning of the access.ip.

I suppose a particular regex ... or maybe my synthax is wrong... can someone have an idea ?

Thanks in advance


Solution

  • RLIKE uses Java regular expressions. So you can use "^" to express starting with something. For example, you can use 'CONCAT("^",client.ip)' to put the "^" before client.ip.

    SELECT client.name, count(access.ip)
    FROM access JOIN client
    WHERE access.ip RLIKE CONCAT("^",client.ip)
    GROUP BY client.name;
    

    However, since "." is also a special character in a regular expression that means any character. So the above solution is not perfect. For example, if the client ip is 1.3, it may match '103.2.3.4'. So a better solution is escaping the "." in the client ip. Here is the final solution:

    SELECT client.name, count(access.ip)
    FROM access JOIN client
    WHERE access.ip RLIKE CONCAT("^",REGEXP_REPLACE(client.ip, "\\.", "\\."))
    GROUP BY client.name;
    

    The first \\. means a regular expression \. (We need to add "\" to specify "\" in Hive). The second \\. means a string \.. If you are not familiar with Java regular expression, it may confuse you.