I have a column of my dataframe data
that contains date and value information in a long string. The column, we will call x
for this purpose, is formatted as such:
x = "{date1:val1, date2:val2, date3:val3, ...}"
I want to ultimately explode this data such that I create two new columns, one for date and one for val. In order to utilize the explode
function, I understand that the column must be formatted as an array, not a string. So far, to handle this issue, I have removed the braces at the start and end of the string:
from pyspark.sql import functions as F
data = data.withColumn('x_1', F.regexp.replace('x', r'\{', ''))
data = data.withColumn('x_1', F.regexp.replace('x_1', r'\}', '')
I then created a list variable:
data = data.withColumn('x_list', F.split('x_1', ', '))
I now have that x_list = [date1:val1, date2:val2, date3:val3, ...]
What I now need is to add quotes around each list element such that I ultimately get ['date1':'val1', 'date2':'val2', 'date3':'val3', ...]
I believe that it may be possible to iterate through the list and use regex to add quotes using the colon (:) as a split point, but I am struggling with how to do that. I believe that it would look something like:
for l in x_list:
#some regex expression
Alternatively, I have considered creating a sublist of each list element, but I am not sure how I would then use those sublists to create two new columns.
This way you could avoid using udf :)
date_val_string = "{date1:val1, date2:val2, date3:val3}"
(
spark.createDataFrame(pd.DataFrame({"col": [date_val_string]}))
.withColumn("array1", f.expr("split(regexp_replace(col, '[{}]', ''), ', ')"))
.withColumn("array2", f.expr("transform(array1, x->split(x, ':'))"))
.selectExpr("explode(array2) as date_val")
.selectExpr("date_val[0] as date", "date_val[1] as val")
.show(truncate = False)
)
+-----+----+
|date |val |
+-----+----+
|date1|val1|
|date2|val2|
|date3|val3|
+-----+----+