sqlarraysjsongoogle-bigquery

How to retrieve a fixed length string from json array


I have a json array that I need to pull a string from. We're storing this data in Bigquery and for the most part it's been pretty easy, except for extracting the id from the user_id field from the array. There are seven possible strings on how the user_id could appear. For instance, in the json array, user_id could be something like this: "user_id": "client-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or it could be "user_id": "operator-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or "user_id": "auto_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

Here is what the sample json looks like:

{
    "members": [
        {
            "channel_count": 0,
            "profile_url": "something.png",
            "push_enabled": true,
            "push_trigger_option": "default",
            "state": "joined",
            "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
        }
    ]
}

I thought the best way to handle this originally was to run a sql replace like so: replace(replace('client-us-operator-us-iddddddd','client-us-', '' ), 'operator-us-', '') But because there are a couple different user_id prefixes it's making it difficult. I am not sure if those are all the possible edge cases. If it's just client and operator it's 12345678-abcd-0926-zyxw-de07e3f9ce85 and that works as expected. When it's "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85" it pulls in the whole string, and sometime the user_id appears as "agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

I just need everything to the right of the prefix. So in this case it would be 12345678-abcd-0926-zyxw-de07e3f9ce85. However depending on if it's an automated message, or a user responding the prefix length changes, as well as the hyphen or underscore.

I'm not the strongest when using regex, so I'm on here to see if there a better way of handling the extraction of this id from the whole user_id string.


Solution

  • REGEXP is what you will want to use since it allows for a defined set of pattern matching both for including and excluding. Answer at bottom.


    Breakdown of what the functions are doing:

    JSON_EXTRACT_SCALAR(json_string, json_path)

    JSON_EXTRACT_SCALAR(your_json_object, '$.user_id') -From my json, find the root key value pair for user_id.


    REGEXP_EXTRACT(string, regex_pattern): for a string, match a pattern.


    WITH sample_data AS (
      SELECT '''
      {
        "members": [
            {
                "channel_count": 0,
                "profile_url": "something.png",
                "push_enabled": true,
                "push_trigger_option": "default",
                "state": "joined",
                "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
            },
            {
                "channel_count": 1,
                "profile_url": "another.png",
                "push_enabled": false,
                "push_trigger_option": "custom",
                "state": "joined",
                "user_id": "operator-id-87654321-zyxw-0926-abcd-ce85de07e3f9"
            }
        ]
      }
      ''' AS json_data
    )
    
    SELECT 
      REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(m, '$.user_id'), r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$') AS extracted_user_id
    FROM 
      sample_data,
      UNNEST(JSON_EXTRACT_ARRAY(json_data, "$.members")) AS m