I have two spark dfs containing different number of columns with the 1st column being the ID (for both). I want to have the column values for each ID as a dictionary (the visual would make better sense of what I am trying to achieve)
what I have:
The result I am trying to get
What I tried but couldn't get the result: => [row.asDict() for row in df.collect()] to get a list of dictionaries but could not find a way to add each as column value.
PS: I understand my question could be XYProblem but, since, I do not know all the concepts available in pyspark/python, this is the best way I could think of to present the issue.
Assuming that each table only has one row per ID
, you can create a MapType
column for each of the table's fields using create_map, then inner join the two tables. Here is a smaller example:
table1 = spark.createDataFrame(
[("1", 34, 45), ("2", 78, 89)],
["ID","col1","col2"]
)
table2 = spark.createDataFrame(
[("1", 43, 54), ("2", 11, 12)],
["ID","col1","col2"]
)
table1 = table1.withColumn("table1_cols", F.create_map(
F.lit("col1"), F.col("col1"), F.lit("col2"), F.col("col2")
)
table2 = table2.withColumn("table2_cols", F.create_map(
F.lit("col1"), F.col("col1"), F.lit("col2"), F.col("col2")
)
table1.join(table2, on=["ID"], how='inner').select(
'ID','table1_cols','table2_cols'
)
Here is the resulting table and schema:
+---+------------------------+------------------------+
|ID |table1_cols |table2_cols |
+---+------------------------+------------------------+
|1 |{col1 -> 34, col2 -> 45}|{col1 -> 43, col2 -> 54}|
|2 |{col1 -> 78, col2 -> 89}|{col1 -> 11, col2 -> 12}|
+---+------------------------+------------------------+
root
|-- ID: string (nullable = true)
|-- table1_cols: map (nullable = false)
| |-- key: string
| |-- value: long (valueContainsNull = true)
|-- table2_cols: map (nullable = false)
| |-- key: string
| |-- value: long (valueContainsNull = true)