sqlregexsnowflake-cloud-data-platformregexp-replaceregexp-substr

regex expression to include german characters


I have a string that looks like this:

{"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}

I want to do a REGEXP_SUBSTR_ALL such that I can extract all "name" values into a list.

Something like this allows me to extract some names but not all. This is because some names include german characters like "Ü" which are not included with '\w+'

SELECT REGEXP_SUBSTR_ALL(ARRAY_TO_STRING(REGEXP_SUBSTR_ALL((ARRAY_TO_STRING(REGEXP_SUBSTR_ALL(ARTICLE_TAGS, '"name\\W+\\w+"'),',')), ':"\\w+'),','),'\\w+') FROM TABLE

For example, the expression above would give me this output:

[
    "simpleword"
]

while my desired output is this:

[
    "simpleword", "f23ÜÜ"
]

Solution

  • Just match everything that is not a ":

    with data(s) as (
        select $${"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}$$
    )
    
    select regexp_substr_all(s, 'name":"([^"]*)"', 1, 1, '', 1)
    from data
    
    -- [   "simpleword",   "f23ÜÜ" ]
    

    Also an alternative (as discussed in the comments, with the 'e' regex_parameters:

    with data(s) as (
        select $${"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}$$
    )
    
    select regexp_substr_all(s, 'name":"([^"]*)"', 1, 1, 'e')
    from data