arrayspyspark

Convert string to array<string> without using regexp


Is there a way to convert a string like [R55, B66] back to array<string> without using regexp?

The Set-up

In this output, we see codes column is StringType. StringType is required for the output but is making consuming the output more challenging for users.

Sample Input Data
input_data = """[{
   "id":"1234",
   "code":"R55"
},{
   "id":"123",
   "code":"B66"
}]"""
input_df = spark.read.json(sc.parallelize([input_data]))
How the Table was created
output_df = input_df.groupBy('id').agg(F.array_agg('code').alias('codes'))
output_df = output_df.withColumn('codes', F.col('codes').cast("string"))
id codes
1234 [R55, B66]
# schema
root
 |-- id: string (nullable = true)
 |-- codes: string (nullable = false)

The Concern

regexp seems to be the only option to convert the string column back into array<string>

What Works
df = output_df.withColumn('codes_array', F.expr(r"regexp_extract_all(codes, '(\\w+)', 1)"))
id codes codes_array
1234 [R55, B66] [R55, B66]
# schema
root
 |-- id: string (nullable = true)
 |-- codes: string (nullable = false)
 |-- codes_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
What Does Not Work

Example 1 - returns NULL

schema = T.ArrayType(T.StringType())
df = output_df.withColumn('codes_array', F.from_json(F.col('codes'), schema))\
id codes codes_array
1234 [R55, B66] NULL
# schema
root
 |-- ServiceActivityID: string (nullable = true)
 |-- correction_codes: string (nullable = false)
 |-- codes_array: array (nullable = false)
 |    |-- element: string (containsNull = true)

Example 2 - Returns Error Message

schema = T.ArrayType(T.StringType())
df = output_df.withColumn('codes_array', F.from_json(F.col('codes'), schema, {'mode': 'FAILFAST'}))

Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'R55': was expecting (JSON String, Number (or 'NaN'/'INF'/'+INF'), Array, Object or token 'null', 'true' or 'false')

The Question(s)

Is there a way to read the codes column back to array<string> without using regexp?

Or is there a way to make codes column write to a StringType format that works with a non-regexp function (like from_json)?

(With my current approach it seem that spark is happy to convert types in 1 direction, but does not set them up in an format that enables them easily be converted back. Is there a better approach to take for this table read/write given the StringType constraint?)


Solution

  • The string generated by directly using cast('string') is not a standard JSON string. You can use the to_json function to generate a standard JSON string, and then use from_json.

    df = df.withColumn('codes', F.to_json('codes'))
    df = df.withColumn('codes_array', F.from_json('codes', 'array<string>'))