I have a pyspark DataFrame that contains to columns, each one is an array of strings, how can I make a new column that is the cartesian product of them without splitting them to two dataframe and join them, and without a udf?
Example:
In df:
Df
+---+---+---+---+-
| a1 | a2 |
+---+---+---+---+-
|[1, 2]|[3, 4, 5]|
|[1, 2]|[7, 8] |
+---+---+---+---+-
Out df:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| a1 | a2 | a3 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}] |
|[1, 2]|[7, 8] |[{1, 7}, {1, 8}, {2, 7}, {2, 8}] |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
You can try nesting transform
to create cartesian product.
This will result in a nested array and you can use flatten
to get the final single array.
df = df.withColumn('a3', F.flatten(F.expr('transform(a1, x -> transform(a2, y -> (x, y)))')))
Result
+------+---------+------------------------------------------------+
|a1 |a2 |a3 |
+------+---------+------------------------------------------------+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}]|
|[1, 2]|[7, 8] |[{1, 7}, {1, 8}, {2, 7}, {2, 8}] |
+------+---------+------------------------------------------------+