We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over. I am using MATCH_RECOGNIZE for this. It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly. Here is sample dataset:
I am using the following Query:
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
This correct identify the users that are posting same tweets over an over:
But I am not able to get the first tweet and the last tweet in the matching sequence.
There are multiple things at play.
The first is you only "have one row trigging a match" so first and last are the second row of you data. This can be seen by changing to ALL ROWS PER MATCH
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa')
)
SELECT
*
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ALL ROWS PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID | TWEETID | TWEET | MATCH_NUMBER | FIRST_TWEET | LAST_TWEET | FIRST_TWEET_ID | LAST_TWEET_ID | NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena | 2 | aaaa | 1 | aaaa | aaaa | 2 | 2 | 1 |
if you change to say a match that catches the first value and the lag values:
ALL ROWS PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
you now match both the first and latter rows..
USERID | TWEETID | TWEET | MATCH_NUMBER | FIRST_TWEET | LAST_TWEET | FIRST_TWEET_ID | LAST_TWEET_ID | NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena | 1 | aaa | 1 | aaa | aaa | 1 | 1 | 1 |
elena | 2 | aaaa | 1 | aaa | aaaa | 1 | 2 | 2 |
now if we expand our test a little bit more with four rows of data:
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('elena', 3, 'aaa'),
('elena', 4, 'aaaa')
)
USERID | TWEETID | TWEET | MATCH_NUMBER | FIRST_TWEET | LAST_TWEET | FIRST_TWEET_ID | LAST_TWEET_ID | NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena | 1 | aaa | 1 | aaa | aaa | 1 | 1 | 1 |
elena | 2 | aaaa | 1 | aaa | aaaa | 1 | 2 | 2 |
elena | 3 | aaa | 1 | aaa | aaa | 1 | 3 | 3 |
elena | 4 | aaaa | 1 | aaa | aaaa | 1 | 4 | 4 |
we see those values are not double registering..
BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.
If we flip back to one row per match
we do how ever get the results we are expecting.
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz')
)
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID | NUM_OF_TWEETS | FIRST_TWEET | LAST_TWEET | FIRST_TWEET_ID | LAST_TWEET_ID |
---|---|---|---|---|---|
elena | 2 | aaa | aaaa | 1 | 2 |
eva | 2 | bbbb | bbbbb | 4 | 5 |