I am using pyspark (1.6) and elasticsearch-hadoop (5.1.1). I am getting my data from elasticsearch into a rdd format via:
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_read_conf)
Here es_read_conf is just a dictionary of my ES cluster, as sc the SparkContext object. This works fine and I get the rdd objects fine.
I'd like to convert this to a dataframe using
df = es_rdd.toDF()
but I get the error:
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
Giving the toDF method a sampleSize results in the same error. From what I understand this is occuring because pyspark is unable to determine the type of each field. I know that there are fields in my elasticsearch cluster that are all null.
What is the best way to convert this to a dataframe?
The best way it to tell Spark types of data you are converting to. Please see documentation of createDataFrame with fifth example (the one with StructType
inside)