I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '@'
and I am using regexp_extract
from each row in that text column. If the text contains multiple words starting with '@'
it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: @always_nidhi,@YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.