I am doing an analysis on pySpark using the Jupyter notebooks. My code originally build dataframes using sqlContext = SQLContext(sc), but now I've switched to HiveContext since I will be using window functions.
My problem is that now I'm getting a Java error when trying to create the dataframe:
## Create new SQL Context.
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql.types import *
import pyspark.sql.functions as func
sqlContext = HiveContext(sc)
After this I read my data into an RDD, and create the schema for my DF.
## After loading the data we define the schema.
fields = [StructField(field_name, StringType(), True) for field_name in data_header]
schema = StructType(fields)
Now, when I try to build the DF this is the error I get:
## Build the DF.
data_df = sqlContext.createDataFrame(data_tmp, schema)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
/home/scala/spark-1.6.1/python/pyspark/sql/context.pyc in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):
TypeError: 'JavaPackage' object is not callable
I have been googling it without luck so far. Any advice is greatly appreciated.
HiveContext
requires binaries build with Hive support. It means you have to enable Hive profile. Since you use sbt assembly
you need at least:
sbt -Phive assembly
The same is required when building with Maven, for example:
mvn -Phive -DskipTests clean package