Is there a way to convert a string like [R55, B66]
back to array<string>
without using regexp?
In this output, we see codes
column is StringType. StringType is required for the output but is making consuming the output more challenging for users.
input_data = """[{
"id":"1234",
"code":"R55"
},{
"id":"123",
"code":"B66"
}]"""
input_df = spark.read.json(sc.parallelize([input_data]))
output_df = input_df.groupBy('id').agg(F.array_agg('code').alias('codes'))
output_df = output_df.withColumn('codes', F.col('codes').cast("string"))
id | codes |
---|---|
1234 | [R55, B66] |
# schema
root
|-- id: string (nullable = true)
|-- codes: string (nullable = false)
regexp seems to be the only option to convert the string
column back into array<string>
df = output_df.withColumn('codes_array', F.expr(r"regexp_extract_all(codes, '(\\w+)', 1)"))
id | codes | codes_array |
---|---|---|
1234 | [R55, B66] | [R55, B66] |
# schema
root
|-- id: string (nullable = true)
|-- codes: string (nullable = false)
|-- codes_array: array (nullable = false)
| |-- element: string (containsNull = true)
Example 1 - returns NULL
schema = T.ArrayType(T.StringType())
df = output_df.withColumn('codes_array', F.from_json(F.col('codes'), schema))\
id | codes | codes_array |
---|---|---|
1234 | [R55, B66] | NULL |
# schema
root
|-- ServiceActivityID: string (nullable = true)
|-- correction_codes: string (nullable = false)
|-- codes_array: array (nullable = false)
| |-- element: string (containsNull = true)
Example 2 - Returns Error Message
schema = T.ArrayType(T.StringType())
df = output_df.withColumn('codes_array', F.from_json(F.col('codes'), schema, {'mode': 'FAILFAST'}))
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'R55': was expecting (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false')
Is there a way to read the codes
column back to array<string>
without using regexp?
Or is there a way to make codes
column write to a StringType format that works with a non-regexp function (like from_json
)?
(With my current approach it seem that spark is happy to convert types in 1 direction, but does not set them up in an format that enables them easily be converted back. Is there a better approach to take for this table read/write given the StringType constraint?)
The string generated by directly using cast('string')
is not a standard JSON string. You can use the to_json
function to generate a standard JSON string, and then use from_json
.
df = df.withColumn('codes', F.to_json('codes'))
df = df.withColumn('codes_array', F.from_json('codes', 'array<string>'))