dataframescalaapache-sparkapache-kafkaspark-streaming

How to convert column rows into a string variable using Spark Dataframe


I need to convert single-column rows into a string variable for use in a where condition while loading from a DB table, instead of loading the entire data from the table.

Sample dataframe like below.

depName emp_name
develop Astrid
develop Freja
develop Wilma
sales Maja
sales Alice
personnel John
personnel Marsh

Expecting output like below, pls help me.

val data='develop','develop','develop','sales','sales','sales','personnel','personnel'

I tried the below logic but COLLECT method taking more time

val result = df.select("depName").collect().map(_.getString(0)).mkString(",")


Solution

  • you need to select the column, collect it. It'll return an array of Row. We'll map over the rows and use getString to convert each value to a string. Finally, mkString would make an overall string of them with a "," as a delimiter.

    import sparkSession.implicits._
              val df = List(
                ("develop", "astrid"),
                ("develop", "Freja"),
                ("develop", "Wilma"),
                ("sales", "Maja"),
                ("sales", "Alice"),
                ("personnel", "John"),
                ("personnel", "Marsh")
              ).toDF("depName", "emp_name")
        
              val result = df.select("depName").collect().map(_.getString(0)).mkString(",")