javapythonapache-sparkpysparkspark-hive

HiveContext createDataFrame not working on pySpark (jupyter)


I am doing an analysis on pySpark using the Jupyter notebooks. My code originally build dataframes using sqlContext = SQLContext(sc), but now I've switched to HiveContext since I will be using window functions.

My problem is that now I'm getting a Java error when trying to create the dataframe:

## Create new SQL Context.
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql.types import *
import pyspark.sql.functions as func

sqlContext = HiveContext(sc)

After this I read my data into an RDD, and create the schema for my DF.

## After loading the data we define the schema.
fields = [StructField(field_name, StringType(), True) for field_name in data_header]
schema = StructType(fields)

Now, when I try to build the DF this is the error I get:

## Build the DF.
data_df = sqlContext.createDataFrame(data_tmp, schema)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
/home/scala/spark-1.6.1/python/pyspark/sql/context.pyc in _get_hive_ctx(self)
    690 
    691     def _get_hive_ctx(self):
--> 692         return self._jvm.HiveContext(self._jsc.sc())
    693 
    694     def refreshTable(self, tableName):

TypeError: 'JavaPackage' object is not callable

I have been googling it without luck so far. Any advice is greatly appreciated.


Solution

  • HiveContext requires binaries build with Hive support. It means you have to enable Hive profile. Since you use sbt assembly you need at least:

    sbt -Phive assembly
    

    The same is required when building with Maven, for example:

    mvn -Phive -DskipTests clean package