regexazure-data-factorygoogle-cloud-dataflow

ADF - Dataflow expression builder cleaning/parsing columns


After successfully transforming the EPOCH timestamp using the following code:

toTimestamp(toLong(toString(published)),'yyyy-MM-dd')

I have a hard time extracting the right information from this column which has 100 rows.

From this:

[["nlp/f/businessEvent/new-deal","New Deals",[["The space exploration company got a huge contract award from NASA in September."],["On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years."]],"about",{}]]

To this:

New Deals - The space exploration company got a huge contract award from NASA in September. On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years.

It should remove:

[["nlp/f/businessEvent/new-deal","
",[["
"],["
"]],"about",{}]]

Additionally in the: [["nlp/f/businessEvent/new-deal"," the businessEvent and new-deal can contain ANY strings

I was looking at:

replace(<string> : string, <substring to find> : string, [<substring to replace> : string]) => string

and

regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string

but I'm having a hard time to come with the right formula.


Solution

  • Expecting your structure will always look like that:

    [["a","b",[["c"],["d"]],"e",{}]]
    

    This regex could do the job:

    \[\["[^"]+","([^"]+)",\[\["([^"]+)"\],\["([^"]+)"\]\],"[^"]+",\{\}\]\]
    

    It uses captures groups like ([^"]+) to get the values of "b", "c" and "d":

    So concatenating the groups :

    group1 group2 group3
    

    Will do the job.

    Remark: This regex is not bullet proof and will not working in cases you have value for a, b, c, d or e like something"with quotes "inside. And will not work if you have spaces between values like [ ["a", "b",[ ["c"],["d"]],"e",{ } ]].

    BUT

    But you string really looks like a json of lists, strings and dictionary. Here is a python example to extract it in a REALLY more secure way:

    More secure

    import json
    
    # Examples of string
    # json_str = '[["a","b",[["c"],["d"]],"e",{}]]'
    json_str = '[["nlp/f/businessEvent/new-deal","New Deals",[["The space exploration company got a huge contract award from NASA in September."],["On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years."]],"about",{}]]'
    
    # Parse the JSON string
    data = json.loads(json_str)
    
    # Extract the values
    # a = data[0][0]
    b = data[0][1]
    c = data[0][2][0][0]
    d = data[0][2][1][0]
    # e = data[0][3]
    
    # Output the extracted values
    # print(f"a: {a}")
    print(f"b: {b}")
    print(f"c: {c}")
    print(f"d: {d}")
    # print(f"e: {e}")