scalaapache-sparkapache-spark-sqlrlike

Using rlike in org.apache.spark.sql.Column


I am trying to implement a query in my Scala code which uses a regexp on a Spark Column to find all the rows in the column which contain a certain value like:

 column.rlike(".*" + str + ".*")

str is a String that can be anything (except null or empty).

This works fine for the basic queries that I am testing. However being new to Spark / Scala, I am unsure of if there are any special cases that could break the code here that I need to take care of. Are there any characters that I need to be escaping or special cases that I need to worry about here?


Solution

  • This can be broken by any invalid regexp. You don't even have to try hard:

    Seq("[", "foo", " ba.r ").toDF.filter($"value".rlike(".*" + "[ " + ".*")).show
    

    or can give unexpected results if str is a non-trivial pattern itself. For simple cases like this you'll be better with Column.contains:

    Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("[")).show
    Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("a.r")).show