hadoopjoinhivebigdataapache-pig

HIVE/PIG JOIN Based on SUBSTRING match


I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:

Tweets Table: (70 million records stored as a HIVE Table)

id tweet
1 Cristiano Ronaldo greatest of all time
2 Brad Pitt movies
3 Random tweet without any person name

Person Names: (1.6 million names stored on HDFS as .tsv file)

id person_name
1 Cristiano Ronaldo
2 Brad Pitt
3 Angelina Jolie

Expected Result:

id tweet person_name
1 Cristiano Ronaldo greatest of all time Cristiano Ronaldo
2 Brad Pitt movies Brad Pitt

What I've tried so far:

I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:

SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;

Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.

I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.

Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?


Solution

  • The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;

    1. I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
    2. Filtering out 'stopwords' or some other list of words that fit in memory.
    3. Split names into (firstnames) and "last name"
    4. Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.

    If you are going to do this regularly consider creating tables with sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)