scalaapache-sparkapache-spark-mllibapache-spark-mlspark-csv

Spark DataFrame handing empty String in OneHotEncoder


I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?

I could reproduce the error in the example provided on Spark ml page:

val df = sqlContext.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, ""),         //<- original example has "a" here
  (4, "a"),
  (5, "c")
)).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

encoded.show()

It is annoying since missing/empty values is a highly generic case.

Thanks in advance, Nikhil


Solution

  • Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :

    java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32) [...]

    This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)

    I'll create an UDF to process the empty category :

    import org.apache.spark.sql.functions._
    
    def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }
    

    Then, I'll apply the UDF on the column :

    val df = sqlContext.createDataFrame(Seq(
       (0, "a"),
       (1, "b"),
       (2, "c"),
       (3, ""),         //<- original example has "a" here
       (4, "a"),
       (5, "c")
    )).toDF("id", "category")
      .withColumn("category",processMissingCategory('category))
    
    df.show
    // +---+--------+
    // | id|category|
    // +---+--------+
    // |  0|       a|
    // |  1|       b|
    // |  2|       c|
    // |  3|      NA|
    // |  4|       a|
    // |  5|       c|
    // +---+--------+
    

    Now, you can go back to your transformations

    val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
    val indexed = indexer.transform(df)
    indexed.show
    // +---+--------+-------------+
    // | id|category|categoryIndex|
    // +---+--------+-------------+
    // |  0|       a|          0.0|
    // |  1|       b|          2.0|
    // |  2|       c|          1.0|
    // |  3|      NA|          3.0|
    // |  4|       a|          0.0|
    // |  5|       c|          1.0|
    // +---+--------+-------------+
    
    // Spark <2.3
    // val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
    // Spark +2.3
    val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
    val encoded = encoder.transform(indexed)
    
    encoded.show
    // +---+--------+-------------+-------------+
    // | id|category|categoryIndex|  categoryVec|
    // +---+--------+-------------+-------------+
    // |  0|       a|          0.0|(3,[0],[1.0])|
    // |  1|       b|          2.0|(3,[2],[1.0])|
    // |  2|       c|          1.0|(3,[1],[1.0])|
    // |  3|      NA|          3.0|    (3,[],[])|
    // |  4|       a|          0.0|(3,[0],[1.0])|
    // |  5|       c|          1.0|(3,[1],[1.0])|
    // +---+--------+-------------+-------------+
    

    EDIT:

    @Anthony 's solution in Scala :

    df.na.replace("category", Map( "" -> "NA")).show
    // +---+--------+
    // | id|category|
    // +---+--------+
    // |  0|       a|
    // |  1|       b|
    // |  2|       c|
    // |  3|      NA|
    // |  4|       a|
    // |  5|       c|
    // +---+--------+
    

    I hope this helps!