scalaapache-sparkthree-valued-logic

Spark Scala : Check if string isn't null or empty


First, due to the three value logic, this isn't just the negation of any valid implementation of a null-or-empty check.

I want to make a function isNotNullish , which is as close as possible to isNotNull but also filters out empty strings. I'm running into some oddities involving how column/column types work, as well as three value logic. So far the best I have is:

def isNotNullish(questionable: Column) : Column = {
  val oddish = questionable === lit("")
  questionable.isNotNull && (!oddish || oddish.isNull)
}

If that looks strange it's because it is. As far as I can tell, questionable === lit("") will return null for a given row if that row holds either null or (some!) non-string types. In three value logic, true && null = null, which would cause questionable.isNotNull && (questionable =!= lit("") to return null in some cases where I'd want it to return true. The questionable.isNotNull && (!oddish || oddish.isNull) code should never yield null, always true or false.

This almost-not-quite works: for some reason I don't understand, the === comparison is happy to return null for numeric types, but fails for complex types. (Even if what I'm trying to do is impossible or inadvisable, I'd love an explanation for that.)

It might be more responsible to check the schema and simply do a different test for string types than for others, but as far as I can tell that requires the dataframe to be passed as a parameter. I'm trying to avoid that, especially for long sequences of transformations where said parameter might be anonymous.

(Before anyone asks, I know that dealing with null and types in this sloppy a way would be horrible in general scala, but I think it's different in the context of Spark/SQL/huge dataframes of varying schema. The specific case is automatic exploration of sketchy data, so being able to answer questions like "I don't know what the columns are, but tell me how often they hold actual values" is useful.)


Solution

  • Does this work you to use <=>

        val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
        employees.show()
        employees.filter(notNullAndEmpty($"salary")).show()
    
      def notNullAndEmpty(c:Column): Column ={
        (c.isNotNull and !(c <=> lit("")))
      }
    
    

    Data -

    +--------+------+
    |employee|salary|
    +--------+------+
    |      E1| 100.0|
    |      E2| 200.0|
    |      E3|  null|
    |      E4|      |
    +--------+------+
    

    Result

    +--------+------+
    |employee|salary|
    +--------+------+
    |      E1| 100.0|
    |      E2| 200.0|
    +--------+------+