I'm trying around with Spark on Hive. In the code I create a new DataFrame
and fill it with custom data by using the HiveContext.createDataFrame
method:
JavaSparkContext sc = ...;
HiveContext hiveCtx = new HiveContext(sc);
StructField f1 = new StructField("columnA", DataTypes.StringType, false, null);
StructField f2 = new StructField("columnB", DataTypes.StringType, false, null);
StructType st = new StructType(new StructField[] {f1, f2});
Row r1 = RowFactory.create("A", "B");
Row r2 = RowFactory.create("C", "D");
List<Row> allRows = new ArrayList<Row>();
allRows.add(r1);
allRows.add(r2);
DataFrame testDF = hiveCtx.createDataFrame(allRows, st);
testDF.explain(); // show the DF data
for(String col : testDF.columns()) { // list the columns, all seems to be ok here?!
System.out.println(col);
}
Column columnA = testDF.col("columnA"); // get the column --> exception!!!
...
When I run the code above by spark-submit
command, I get the following output:
=== APP RUNNING ===
17/03/13 12:20:29 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/03/13 12:20:29 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/03/13 12:20:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
17/03/13 12:20:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
17/03/13 12:20:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/03/13 12:20:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/03/13 12:20:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/03/13 12:20:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/03/13 12:20:33 INFO metastore: Trying to connect to metastore with URI thrift://my-server-url:9083
17/03/13 12:20:33 INFO metastore: Connected to metastore.
== Physical Plan ==
LocalTableScan [columnA#0,columnB#1], [[A,B],[C,D]]
columnA
columnB
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:218)
at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
at scala.Tuple2.hashCode(Tuple2.scala:19)
at scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:391)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at scala.collection.mutable.FlatHashTable$class.findEntryImpl(FlatHashTable.scala:123)
at scala.collection.mutable.FlatHashTable$class.containsEntry(FlatHashTable.scala:119)
at scala.collection.mutable.HashSet.containsEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.contains(HashSet.scala:58)
at scala.collection.GenSetLike$class.apply(GenSetLike.scala:43)
at scala.collection.mutable.AbstractSet.apply(Set.scala:45)
at scala.collection.SeqLike$$anonfun$distinct$1.apply(SeqLike.scala:494)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
at scala.collection.AbstractSeq.distinct(Seq.scala:40)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at temp.HiveTest.main(HiveTest.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Here my spark-submit
call:
spark-submit --class temp.HiveTest --master yarn --deploy-mode client /home/daniel/application.jar
Why give the call of DataFrame.col(...)
a NullPointerException
??
Try to change null
to Metadata.empty()
:
StructField f1 = new StructField("columnA", DataTypes.StringType, false, Metadata.empty());
StructField f2 = new StructField("columnB", DataTypes.StringType, false, Metadata.empty());