I have this PySpark dataframe:
df = spark.createDataFrame(
[('JOHN', 'SAM'),
('JOHN', 'PETER'),
('JOHN', 'ROBIN'),
('BEN', 'ROSE'),
('BEN', 'GRAY')],
['DOCTOR', 'PATIENT'])
DOCTOR | PATIENT
JOHN | SAM
JOHN | PETER
JOHN | ROBIN
BEN | ROSE
BEN | GRAY
I need to concatenate patient names by rows so that I get the output like this:
DOCTOR | PATIENT
JOHN | SAM, PETER, ROBIN
BEN | ROSE, GRAY
How to do it?
The simplest way I can think of is to use collect_list
import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))