sqlscalaapache-sparkdataframeapache-spark-sql

Select field only if it exists (SQL or Scala)


The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.

For example, this statement will work.

Select store, prod, distance from table

+-----+------+--------+
|store|prod  |distance|
+-----+------+--------+
|51   |42    |2     |
|51   |42    |5     |
|89   |44    |9     |

If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')

+-----+------+
|store|prod  |
+-----+------+
|51   |42    |
|51   |42    |
|89   |44    |

Solution

  • You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect

    val columnNames = Seq("c1","c2","c3","c4")
    
    df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()