dataframeapache-sparkpysparkgroup-byconcatenation

Concatenating string by rows in PySpark


I have this PySpark dataframe:

df = spark.createDataFrame(
    [('JOHN', 'SAM'),
     ('JOHN', 'PETER'),
     ('JOHN', 'ROBIN'),
     ('BEN', 'ROSE'),
     ('BEN', 'GRAY')],
    ['DOCTOR', 'PATIENT'])
DOCTOR | PATIENT
JOHN   | SAM
JOHN   | PETER
JOHN   | ROBIN
BEN    | ROSE
BEN    | GRAY

I need to concatenate patient names by rows so that I get the output like this:

DOCTOR | PATIENT
JOHN   | SAM, PETER, ROBIN
BEN    | ROSE, GRAY

How to do it?


Solution

  • The simplest way I can think of is to use collect_list

    import pyspark.sql.functions as f
    df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))