regexscalaapache-sparkapache-spark-sqlextract

Extract words from a string column in spark dataframe


I have a column in spark dataframe which has text.

I want to extract all the words which start with a special character '@' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '@' it just returns the first one.

I am looking for extracting multiple words which match my pattern in Spark.

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube


Solution

  • You can create a udf function in spark as below:

    import java.util.regex.Pattern
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.functions.lit
    
    def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
          println("the column value is" + job.toString())
          val pattern = Pattern.compile(exp.toString)
          val m = pattern.matcher(job.toString)
          var result = Seq[String]()
          while (m.find) {
            val temp = 
            result =result:+m.group(groupIdx)
          }
          result.mkString(",")
        })
    

    And then call the udf as below:

    data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()
    

    Above you give you output as below:

    +--------------------+
    |               Names|
    +--------------------+
    |@always_nidhi,@Yo...|
    +--------------------+
    

    I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.