I have two tables on hive. The first, called 'access', contains apache logs, where the first field is an complete ip adress :
10.4.5.12 - - [26/Jun/2010:11:16:09 +1000] "GET /myportal/pageA HTTP/1.1"
10.4.41.2 - - [26/Jun/2010:11:18:09 +1000] "GET /myportal/pageB HTTP/1.1"
10.5.1.111 - - [26/Jun/2010:11:22:09 +1000] "GET /myportal/pageA HTTP/1.1"
192.10.4.177 - - [26/Jun/2010:11:22:41 +1000] "GET /myportal/pageC HTTP/1.1"
and an other, called 'client' that contains the beginning of an ip range and a string :
10.4 clientA
10.5 clientB
10.7 ClientC
I would like to find the total of hits, by client, and display their name. So, I try to join this two tables like this :
SELECT client.name, count(access.ip) FROM access JOIN client WHERE access.ip RLIKE client.ip GROUP BY client.name;
It works but for the clientA, I get a hit for the last entry (192.10.4.177) of my apache log too, what I don't want. I would like to comparre client.ip only with the beginning of the access.ip.
I suppose a particular regex ... or maybe my synthax is wrong... can someone have an idea ?
Thanks in advance
RLIKE uses Java regular expressions. So you can use "^" to express starting with something. For example, you can use 'CONCAT("^",client.ip)' to put the "^" before client.ip
.
SELECT client.name, count(access.ip)
FROM access JOIN client
WHERE access.ip RLIKE CONCAT("^",client.ip)
GROUP BY client.name;
However, since "." is also a special character in a regular expression that means any character. So the above solution is not perfect. For example, if the client ip is 1.3
, it may match '103.2.3.4'. So a better solution is escaping the "." in the client ip. Here is the final solution:
SELECT client.name, count(access.ip)
FROM access JOIN client
WHERE access.ip RLIKE CONCAT("^",REGEXP_REPLACE(client.ip, "\\.", "\\."))
GROUP BY client.name;
The first \\.
means a regular expression \.
(We need to add "\" to specify "\" in Hive). The second \\.
means a string \.
. If you are not familiar with Java regular expression, it may confuse you.