After successfully transforming the EPOCH timestamp using the following code:
toTimestamp(toLong(toString(published)),'yyyy-MM-dd')
I have a hard time extracting the right information from this column which has 100 rows.
From this:
[["nlp/f/businessEvent/new-deal","New Deals",[["The space exploration company got a huge contract award from NASA in September."],["On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years."]],"about",{}]]
To this:
New Deals - The space exploration company got a huge contract award from NASA in September. On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years.
It should remove:
[["nlp/f/businessEvent/new-deal","
",[["
"],["
"]],"about",{}]]
Additionally in the: [["nlp/f/businessEvent/new-deal"," the businessEvent and new-deal can contain ANY strings
I was looking at:
replace(<string> : string, <substring to find> : string, [<substring to replace> : string]) => string
and
regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string
but I'm having a hard time to come with the right formula.
Expecting your structure will always look like that:
[["a","b",[["c"],["d"]],"e",{}]]
This regex could do the job:
\[\["[^"]+","([^"]+)",\[\["([^"]+)"\],\["([^"]+)"\]\],"[^"]+",\{\}\]\]
It uses captures groups like ([^"]+)
to get the values of "b", "c" and "d":
group1
group2
group3
So concatenating the groups :
group1 group2 group3
Will do the job.
Remark: This regex is not bullet proof and will not working in cases you have value for a, b, c, d or e like something"with quotes "inside
. And will not work if you have spaces between values like [ ["a", "b",[ ["c"],["d"]],"e",{ } ]]
.
But you string really looks like a json of lists, strings and dictionary. Here is a python example to extract it in a REALLY more secure way:
import json
# Examples of string
# json_str = '[["a","b",[["c"],["d"]],"e",{}]]'
json_str = '[["nlp/f/businessEvent/new-deal","New Deals",[["The space exploration company got a huge contract award from NASA in September."],["On Sept. 17, NASA awarded Intuitive Machines a Near Space Network contract with a maximum potential value of $4.82 billion over the next 10 years."]],"about",{}]]'
# Parse the JSON string
data = json.loads(json_str)
# Extract the values
# a = data[0][0]
b = data[0][1]
c = data[0][2][0][0]
d = data[0][2][1][0]
# e = data[0][3]
# Output the extracted values
# print(f"a: {a}")
print(f"b: {b}")
print(f"c: {c}")
print(f"d: {d}")
# print(f"e: {e}")