apache-sparkhivespark-shell

How to filter out all null values from all the columns of a table in one go using Spark-shell?


I am using Spark shell 1.6. I want to perform a check to separate all the rows containing null values from the once that don't. More precisely I have to segregate them into 2 different tables (data and error). Problem is that I have too many columns (42) so checking it one at a time goes out of the question. Some points which may help:

  1. Schema of the data contains bigint, int, String and date formats.
  2. Tables used are all hive tables.
  3. I have CSV for the data as well (but using spark shell is a must).

Solution

  • scala> val data = Seq(("1","one",null),(null,"2","null"),("3",null,null),(null,null,null),("4","four","IV")).toDF
    scala> data.show
    +----+----+----+
    |  _1|  _2|  _3|
    +----+----+----+
    |   1| one|null|
    |null|   2|null|
    |   3|null|null|
    |null|null|null|
    |   4|four|  IV|
    +----+----+----+
    scala> val error = data.filter(x=>x.anyNull)
    scala> error.show
    +----+----+----+
    |  _1|  _2|  _3|
    +----+----+----+
    |   1| one|null|
    |null|   2|null|
    |   3|null|null|
    |null|null|null|
    +----+----+----+
    scala> val correct_data =data.except(error)
    scala> correct_data.show
    +---+----+---+
    | _1|  _2| _3|
    +---+----+---+
    |  4|four| IV|
    +---+----+---+