pythonpysparkpandas-explode

Adding quotes to list objects to format as a dictionary pyspark


I have a column of my dataframe data that contains date and value information in a long string. The column, we will call x for this purpose, is formatted as such:

x = "{date1:val1, date2:val2, date3:val3, ...}" 

I want to ultimately explode this data such that I create two new columns, one for date and one for val. In order to utilize the explode function, I understand that the column must be formatted as an array, not a string. So far, to handle this issue, I have removed the braces at the start and end of the string:

from pyspark.sql import functions as F

data = data.withColumn('x_1', F.regexp.replace('x', r'\{', ''))
data = data.withColumn('x_1', F.regexp.replace('x_1', r'\}', '') 

I then created a list variable:

data = data.withColumn('x_list', F.split('x_1', ', '))

I now have that x_list = [date1:val1, date2:val2, date3:val3, ...]

What I now need is to add quotes around each list element such that I ultimately get ['date1':'val1', 'date2':'val2', 'date3':'val3', ...]

I believe that it may be possible to iterate through the list and use regex to add quotes using the colon (:) as a split point, but I am struggling with how to do that. I believe that it would look something like:

for l in x_list:
   #some regex expression

Alternatively, I have considered creating a sublist of each list element, but I am not sure how I would then use those sublists to create two new columns.


Solution

  • This way you could avoid using udf :)

    date_val_string = "{date1:val1, date2:val2, date3:val3}"
    (
        spark.createDataFrame(pd.DataFrame({"col": [date_val_string]}))
        .withColumn("array1", f.expr("split(regexp_replace(col, '[{}]', ''), ', ')"))
        .withColumn("array2", f.expr("transform(array1, x->split(x, ':'))"))
        .selectExpr("explode(array2) as date_val")
        .selectExpr("date_val[0] as date", "date_val[1] as val")
        .show(truncate = False)
    )
    
    
    +-----+----+
    |date |val |
    +-----+----+
    |date1|val1|
    |date2|val2|
    |date3|val3|
    +-----+----+