[SOLVED] Pyspark converting rdd to dataframe with nulls

Pyspark converting rdd to dataframe with nulls

I am using pyspark (1.6) and elasticsearch-hadoop (5.1.1). I am getting my data from elasticsearch into a rdd format via:

es_rdd = sc.newAPIHadoopRDD(                                               
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",          
    keyClass="org.apache.hadoop.io.NullWritable",                          
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",     
    conf=es_read_conf)

Here es_read_conf is just a dictionary of my ES cluster, as sc the SparkContext object. This works fine and I get the rdd objects fine.

I'd like to convert this to a dataframe using

df = es_rdd.toDF()

but I get the error:

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Giving the toDF method a sampleSize results in the same error. From what I understand this is occuring because pyspark is unable to determine the type of each field. I know that there are fields in my elasticsearch cluster that are all null.

What is the best way to convert this to a dataframe?

Solution

The best way it to tell Spark types of data you are converting to. Please see documentation of createDataFrame with fifth example (the one with StructType inside)