[SOLVED] HiveContext createDataFrame not working on pySpark (jupyter)

HiveContext createDataFrame not working on pySpark (jupyter)

I am doing an analysis on pySpark using the Jupyter notebooks. My code originally build dataframes using sqlContext = SQLContext(sc), but now I've switched to HiveContext since I will be using window functions.

My problem is that now I'm getting a Java error when trying to create the dataframe:

## Create new SQL Context.
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql.types import *
import pyspark.sql.functions as func

sqlContext = HiveContext(sc)

After this I read my data into an RDD, and create the schema for my DF.

## After loading the data we define the schema.
fields = [StructField(field_name, StringType(), True) for field_name in data_header]
schema = StructType(fields)

Now, when I try to build the DF this is the error I get:

## Build the DF.
data_df = sqlContext.createDataFrame(data_tmp, schema)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
/home/scala/spark-1.6.1/python/pyspark/sql/context.pyc in _get_hive_ctx(self)
    690 
    691     def _get_hive_ctx(self):
--> 692         return self._jvm.HiveContext(self._jsc.sc())
    693 
    694     def refreshTable(self, tableName):

TypeError: 'JavaPackage' object is not callable

I have been googling it without luck so far. Any advice is greatly appreciated.

Solution

HiveContext requires binaries build with Hive support. It means you have to enable Hive profile. Since you use sbt assembly you need at least:

sbt -Phive assembly

The same is required when building with Maven, for example:

mvn -Phive -DskipTests clean package