I am trying to implement a query in my Scala code which uses a regexp on a Spark Column
to find all the rows in the column which contain a certain value like:
column.rlike(".*" + str + ".*")
str
is a String that can be anything (except null
or empty
).
This works fine for the basic queries that I am testing. However being new to Spark / Scala, I am unsure of if there are any special cases that could break the code here that I need to take care of. Are there any characters that I need to be escaping or special cases that I need to worry about here?
This can be broken by any invalid regexp. You don't even have to try hard:
Seq("[", "foo", " ba.r ").toDF.filter($"value".rlike(".*" + "[ " + ".*")).show
or can give unexpected results if str
is a non-trivial pattern itself. For simple cases like this you'll be better with Column.contains
:
Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("[")).show
Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("a.r")).show