scalaapache-sparkdataframeapache-spark-sql

How do I detect if a Spark DataFrame has a column


When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select

Example JSON schema:

{
  "a": {
    "b": 1,
    "c": 2
  }
}

This is what I want to do:

potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))

but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:

scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)

Solution

  • Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:

    import scala.util.Try
    import org.apache.spark.sql.DataFrame
    
    def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess
    
    val df = sqlContext.read.json(sc.parallelize(
      """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))
    
    hasColumn(df, "foobar")
    // Boolean = false
    
    hasColumn(df, "foo")
    // Boolean = true
    
    hasColumn(df, "foo.bar")
    // Boolean = true
    
    hasColumn(df, "foo.bar.foobar")
    // Boolean = true
    
    hasColumn(df, "foo.bar.foobaz")
    // Boolean = false
    

    Or even simpler:

    val columns = Seq(
      "foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")
    
    columns.flatMap(c => Try(df(c)).toOption)
    // Seq[org.apache.spark.sql.Column] = List(
    //   foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)
    

    Python equivalent:

    from pyspark.sql.utils import AnalysisException
    from pyspark.sql import Row
    
    
    def has_column(df, col):
        try:
            df[col]
            return True
        except AnalysisException:
            return False
    
    df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()
    
    has_column(df, "foobar")
    ## False
    
    has_column(df, "foo")
    ## True
    
    has_column(df, "foo.bar")
    ## True
    
    has_column(df, "foo.bar.foobar")
    ## True
    
    has_column(df, "foo.bar.foobaz")
    ## False