I'm trying to use Snowflake's match_recognize tool to match a particular pattern across rows. The pattern consists of any sequence of a's and b's, provided that long runs of b's are excluded. In my test case, I want to allow runs of up to 4 b's to be included in the match.
Using the handy https://regexr.com/?2tp0k website, I was able to build the desired regexp:
((ab{0,4})+a)|a+
Applying it to this string:
baabbbaaaaaaaababbabbabbabbbabbbab
I get this one match (in bold), which I am happy with:
baabbbaaaaaaaababbabbabbabbbabbbab
As desired, this is absorbing into the match any run of b's that is 4 or shorter. (It doesn't pick up b at the beginning of the string or the b at the end, but that is expected.) Also note that while it doesn't contain any long runs of b's, there are a bunch of b's spread throughout that match.
For some reason, when I use this regular expression with Snowflake's match_recognize pattern, it doesn't take up all of the short runs of b's.
Hence, instead of matching the entire sequence matched above, it matches these:
baabbbaaaaaaaababbabbabbabbbabbbab
Any suggestions?
Here's the query that illustrates the result:
WITH data AS (
SELECT * FROM VALUES
( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
(10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
(20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
(30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data
match_recognize(
order by column1
measures
match_number() as "MATCH_NUMBER",
match_sequence_number() as msq,
classifier() as cl
all rows per match with unmatched rows
PATTERN ( ((a b{0,4})+ a) | a+ )
DEFINE
a as column2 = 'a',
b as column2 = 'b'
)
ORDER BY 1;
Resulting in this result. Rows 25-27 are not included in the match, and a new match is started at row 28.
WITH data AS (
SELECT * FROM VALUES
( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
(10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
(20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
(30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data
match_recognize(
order by column1
measures
match_number() as "MATCH_NUMBER",
match_sequence_number() as msq,
classifier() as cl
all rows per match with unmatched rows
AFTER MATCH SKIP PAST LAST ROW
PATTERN ( ((a+ b{1,4})+ a) | a+ )
DEFINE
a as column2 = 'a',
b as column2 = 'b'
) ORDER BY 1;