pysparkcartesian-product

Pyspark Cartesian product of two columns in a dataframe


I have a pyspark DataFrame that contains to columns, each one is an array of strings, how can I make a new column that is the cartesian product of them without splitting them to two dataframe and join them, and without a udf?

Example:

In df:
Df
+---+---+---+---+-
| a1    | a2     |
+---+---+---+---+-
|[1, 2]|[3, 4, 5]|
|[1, 2]|[7, 8]   |
+---+---+---+---+-

Out df:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| a1    | a2     | a3                                               |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}]  |
|[1, 2]|[7, 8]   |[{1, 7}, {1, 8}, {2, 7}, {2, 8}]                  |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Solution

  • You can try nesting transform to create cartesian product.

    This will result in a nested array and you can use flatten to get the final single array.

    df = df.withColumn('a3', F.flatten(F.expr('transform(a1, x -> transform(a2, y -> (x, y)))')))
    

    Result

    +------+---------+------------------------------------------------+
    |a1    |a2       |a3                                              |
    +------+---------+------------------------------------------------+
    |[1, 2]|[3, 4, 5]|[{1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5}]|
    |[1, 2]|[7, 8]   |[{1, 7}, {1, 8}, {2, 7}, {2, 8}]                |
    +------+---------+------------------------------------------------+