I have a Spark Data Frame that contains Timestamp and Machine Ids. I wish to remove the lowest timestamp value from each group. I tried following code:
sqlC <- sparkRHive.init(sc)
ts_df2<- sql(sqlC,"SELECT ts,Machine FROM sdf2 EXCEPT SELECT MIN(ts),Machine FROM sdf2 GROUP BY Machine")
But the following error is coming:
16/04/06 06:47:52 ERROR RBackendHandler: sql on 35 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: missing EOF at 'SELECT' near 'EXCEPT'; line 1 pos 35
What is the problem? If HiveContext does not support EXCEPT keyword what will be synonymous way of doing the same in HiveContext?
The programming guide for Spark 1.6.1 shows supported and unsupported Hive features in Spark 1.6.1 http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
I don't see EXCEPT in either category. I saw elsewhere that Hive QL doesn't support EXCEPT, or at least did not at that time. Hive QL Except clause
Perhaps try a table of the mins and then do a left outer join as in that answer?
SELECT ts, Machine FROM ts mins LEFT OUTER JOIN ts mins ON (ts.id=mins.id) WHERE mins.id IS NULL;
You can also use the sparkR built-in function except(), though I think you would need to create you mins DataFrame first
exceptDF <- except(df, df2)