dataframepyspark

Pyspark LIT a dictionary into a dataframe


I have a python dict as shown below

cam = {"emp_id":1234, "emp_acct": [6784, 8901], "start_date":"2002-05-06"}

I want to create a new Dataframe by add this as a new column to an existing dataframe and I tried

df = input_df.withColumn("cam_details", lit(cam))

This throws as error as LIT takes only a string..


Solution

  • As you haven't shared the expected output schema - here's what I think you want:

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    spark = SparkSession.builder.getOrCreate()
    
    input_df = spark.createDataFrame([(1,)], "id int")
    
    cam = {"emp_id":1234, "emp_acct": [6784, 8901], "start_date":"2002-05-06"}
    
    cam_schema = "struct<emp_id:string, emp_acct:array<int>, start_date:date>"
    
    df = df.withColumn("cam_details", F.from_json(F.lit(str(cam)), cam_schema))
    
    df.printSchema()
    df.show(truncate=False)
    

    Output:

    root
     |-- id: integer (nullable = true)
     |-- cam_details: struct (nullable = true)
     |    |-- emp_id: string (nullable = true)
     |    |-- emp_acct: array (nullable = true)
     |    |    |-- element: integer (containsNull = true)
     |    |-- start_date: date (nullable = true)
    
    +---+--------------------------------+
    |id |cam_details                     |
    +---+--------------------------------+
    |1  |{1234, [6784, 8901], 2002-05-06}|
    +---+--------------------------------+