I am using Hive 1.2 and Spark 1.4.1. The Following query runs perfectly fine via Hive CLI:
hive> select row_number() over (partition by one.id order by two.id) as sk,
two.id, two.name, one.name, current_date()
from avant_source.one one
inner join avant_source.two two
on one.id = two.one_id;
but when I try to use it via HiveContext in a pyspark job it gives me an error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql.
: java.lang.RuntimeException: Couldn't find function current_date
Code snippet:
from pyspark import HiveContext
conf = SparkConf().setAppName('DFtest')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df = sqlContext.sql("select row_number() over (partition by one.id order by two.id) as sk, two.id, two.name, one.name, current_date() from avant_source.one one inner join avant_source.two two on one.id = two.one_id")
df.show()
sc.stop()
Is there a way to get the current date or timestamp in pyspark? I tried importing date, datetime, but it always throws an error saying function not found.
I tried to use current_date in Data Frames in pyspark 1.5 Sandbox, but then also I get a different error.
df = sqlContext.createDataFrame([(current_date,)],[ādā])
df.select(date_sub(df.d,1).alias('d')).collect()
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/mapr/spark/spark-1.5.2/python/pyspark/sql/dataframe.py", line 769, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/opt/mapr/spark/spark-1.5.2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/mapr/spark/spark-1.5.2/python/pyspark/sql/utils.py", line 40, in deco
raise AnalysisException(s.split(': ', 1)[1])
pyspark.sql.utils.AnalysisException: cannot resolve 'datesub(d,1)' due to data type mismatch: argument 1 requires date type, however, 'd' is of struct<> type.;
Please advise.
For my scenario, I used the following
import datetime
now = datetime.datetime.now()
df = df.withColumn('eff_start', lit(now.strftime("%Y-%m-%d")))
For the error of not being able to use HiveContext for HiveQL correctly for Hive functions, it was a cluster issue, where one of the nodes on which HiveServer2 was running had too many alarms due to memory issues. That was causing the problem. It was tested successfully on a MapR Sandbox running Spark 1.5 and Hive 1.2