apache-sparkpysparkazure-databricks

How to Convert grouped names into distinct person entries with country preserved


I'm working with a PySpark Data Frame that looks something like this:

+--------------+-----------+
|customer_names|country    |
+--------------+-----------+
|jan,marek     |Poland     |
|anna,kasia    |Poland     |
|john,emma     |New Zealand|
|oliver,ava    |New Zealand|
|tomasz,ewa    |Poland     |
|liam,amelia   |New Zealand|
+--------------+-----------+

each row has a string of first names separated by commas, along with the country they belong to. for example: john, emma from New Zealand should turn into two separate rows

Expected output like below

+-------+-----------+
|cx_name|    country|
+-------+-----------+
|    jan|     Poland|
|  marek|     Poland|
|   anna|     Poland|
|  kasia|     Poland|
|   john|New Zealand|
|   emma|New Zealand|
+-------+-----------+

Thanks!


Solution

  • Use split and explode functions together.

    import pyspark.sql.functions as F
    
    ...
    df = df.select(F.explode(F.split('customer_names', ',')).alias('cx_name'), 'country')