I'm trying to extract the 'abc-def' part of the URLs below. I can do it with two different regex patterns. (see examples) Is it possible to write one regex that works for all cases below?
(BigQuery doesn't seem to support lookback)
SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def/', r'([^/]+)/?$')
UNION ALL
SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def', r'([^/]+)/?$')
UNION ALL
SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def?p=294', r'([^/]+)/?[$|\?]')
UNION ALL
SELECT REGEXP_EXTRACT('https://www.example.com/post/abc-def/?p=294', r'([^/]+)/?[$|\?]')
UNION ALL
SELECT REGEXP_EXTRACT('http://www.example.com/abc-def/?p=294', r'([^/]+)/?[$|\?]')
Expected output 'abc-def'
Note that [$|\?]
matches either a $
, |
or ?
chars since [...]
specifies a character class.
Using REGEXP_EXTRACT
that only returns the first match from the given input string, you may use the ([^/?]+)/?(?:$|\?)
regex:
REGEXP_EXTRACT(col, r'([^/?]+)/?(?:$|\?)')
Details
([^/?]+)
- Group 1:/?
- an optional /
symbol(?:$|\?)
- a non-capturing group matching either end of string or a ?
char.If you want to test the pattern at regex101, make sure you test against each input individually, not a multiline string.
Another solution is using the ^(?:.*/)?([^/?]+)
pattern (add \n
into the negated character class when testing at regex101.com).
^
- start of string(?:.*/)?
- an optional sequence of any zero or more chars as many as possible followed with a /
char([^/?]+)
- Group 1: any one or more chars other than /
and ?
.