I processed data using pySpark
and sqlContext
using the following query:
(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
.rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))
It is stored in the following format:
Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)
But I want to store the data in a Hive table as
LastUpdate Count
2016-03-14 12:27:55.01 1
. .
. .
Here is how I create the table in Hive:
CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';
I tried many options but was not successful. Please help me on this.
Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.
from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)
#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]
#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")