rnasparkr

Why is SparkR-dropna not giving me the desired output?


I have applied the following code on airquality dataset available in R, which has some missing values. I want to omit the rows which has NAs

library(SparkR)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')

sc <- sparkR.init("local",sparkHome = "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")

sqlContext <- sparkRSQL.init(sc)

path<-"/Users/devesh/work/airquality/"

aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", header="true", inferSchema="true")

head(dropna(aq,how="any"))
Ozone Solar_R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

The NAs still exist in the output. Am I missing something here?


Solution

  • Missing values in native R are represented with a logical constant, <NA>. SparkR DataFrames represents missing values with NULL. If you use createDataFrame() to turn a local R data.frame into a distributed SparkR DataFrame, SparkR will automatically convert <NA> to NULL. However, if you are creating a SparkR DataFrame by reading in data from a file using read.df(), you may have strings of "NA", but not R logical constant <NA> missing value representations. String "NA" is not automatically converted to NULL, so dropna() will not consider it as a missing value.

    If you have "NA" strings in your csv you might filter them rather than using dropna():

    filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")

    head(filtered_aq)