sqlregexgoogle-bigquery

BigQuery REGEXP_EXTRACT end of URL without unsupported lookback


I'm trying to extract the 'abc-def' part of the URLs below. I can do it with two different regex patterns. (see examples) Is it possible to write one regex that works for all cases below?

(BigQuery doesn't seem to support lookback)

SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def/', r'([^/]+)/?$')
UNION ALL

SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def', r'([^/]+)/?$')
UNION ALL

SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def?p=294', r'([^/]+)/?[$|\?]')
UNION ALL

SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def/?p=294', r'([^/]+)/?[$|\?]')
UNION ALL

SELECT REGEXP_EXTRACT('http://www.example.com/abc-def/?p=294', r'([^/]+)/?[$|\?]')

Expected output 'abc-def'


Solution

  • Note that [$|\?] matches either a $, | or ? chars since [...] specifies a character class.

    Using REGEXP_EXTRACT that only returns the first match from the given input string, you may use the ([^/?]+)/?(?:$|\?) regex:

    REGEXP_EXTRACT(col, r'([^/?]+)/?(?:$|\?)')
    

    Details

    If you want to test the pattern at regex101, make sure you test against each input individually, not a multiline string.

    Another solution is using the ^(?:.*/)?([^/?]+) pattern (add \n into the negated character class when testing at regex101.com).