I'm working with a PySpark Data Frame that looks something like this:
+--------------+-----------+
|customer_names|country |
+--------------+-----------+
|jan,marek |Poland |
|anna,kasia |Poland |
|john,emma |New Zealand|
|oliver,ava |New Zealand|
|tomasz,ewa |Poland |
|liam,amelia |New Zealand|
+--------------+-----------+
each row has a string of first names separated by commas, along with the country they belong to. for example: john, emma from New Zealand should turn into two separate rows
Expected output like below
+-------+-----------+
|cx_name| country|
+-------+-----------+
| jan| Poland|
| marek| Poland|
| anna| Poland|
| kasia| Poland|
| john|New Zealand|
| emma|New Zealand|
+-------+-----------+
Thanks!
Use split
and explode
functions together.
import pyspark.sql.functions as F
...
df = df.select(F.explode(F.split('customer_names', ',')).alias('cx_name'), 'country')