pythonapache-sparkpysparkrdd

How to extract an element from an array in PySpark


I have a data frame with following type:

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be of the following type:

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array, and I want to convert it into a separate column. What needs to be done?

I saw many answers with flatMap, but they are increasing a row. I want the tuple to be put in another column but in the same row.

The following is my current schema:

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also, can please someone help me with explanation on both dataframes and RDD's.


Solution

  • Create sample data:

    from pyspark.sql import Row
    x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
    rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
    df = spark.createDataFrame(rdd)
    df.show()
    #+----+----+----+----------+
    #|col1|col2|col3|      col4|
    #+----+----+----+----------+
    #|  xx|  yy|  zz|[123, 234]|
    #+----+----+----+----------+
    

    Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP):

    df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
    df.show()
    #+----+----+----+----+----+
    #|col1|col2|col3|col4|col5|
    #+----+----+----+----+----+
    #|  xx|  yy|  zz| 123| 234|
    #+----+----+----+----+----+