I have an ammonite script which creates a spark context:
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.0.1`
import org.apache.spark.{SparkConf, SparkContext}
@main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
When I run this script, it throws an error:
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: org.apache.spark.SparkException: Error while locating file spark-version-info.properties
...
Caused by: java.lang.NullPointerException
at java.util.Properties$LineReader.readLine(Properties.java:434)
at java.util.Properties.load0(Properties.java:353)
The script isn't being run from the spark installation directory and doesn't have any knowledge of it or the resources where this version information is packaged - it only knows about the ivy dependencies. So perhaps the issue is that this resource information isn't on the classpath in the ivy dependencies. I have seen other spark "standalone scripts" so I was hoping I could do the same here.
I poked around a bit to try and understand what was happening. I was hoping I could programmatically hack some build information into the system properties at runtime.
The source of the exception comes from package.scala in the spark library. The relevant bits of code are
val resourceStream = Thread.currentThread().getContextClassLoader.
getResourceAsStream("spark-version-info.properties")
try {
val unknownProp = "<unknown>"
val props = new Properties()
props.load(resourceStream) <--- causing a NPE?
(
props.getProperty("version", unknownProp),
// Load some other properties
)
} catch {
case npe: NullPointerException =>
throw new SparkException("Error while locating file spark-version-info.properties", npe)
It seems that the implicit assumption is that props.load
will fail with a NPE if the version information can't be found in the resources. (That's not so clear to the reader!)
The NPE itself looks like it's coming from this code in java.util.Properties.java
:
class LineReader {
public LineReader(InputStream inStream) {
this.inStream = inStream;
inByteBuf = new byte[8192];
}
...
InputStream inStream;
Reader reader;
int readLine() throws IOException {
...
inLimit = (inStream==null)?reader.read(inCharBuf)
:inStream.read(inByteBuf);
The LineReader
is constructed with a null InputStream
which the class internally interprets as meaning that the reader
is non-null and should be used instead - but it's also null
. (Is this kind of stuff really in the standard library? Seems very unsafe...)
From looking at the bin/spark-shell
that comes with spark, it adds -Dscala.usejavacp=true
when it launches spark-submit
. Is this the right direction?
Thanks for your help!
Following seems to work on 2.11 with 1.0.1 version but not experimental.
Could be just better implemented on Spark 2.2
#!/usr/local/bin/amm
import ammonite.ops._
import $ivy.`org.apache.spark:spark-core_2.11:2.2.0`
import $ivy.`org.apache.spark:spark-sql_2.11:2.2.0`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
@main
def main(): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}
or more expanded answer:
@main
def main(): Unit = {
val spark = SparkSession.builder()
.appName("testings")
.master("local")
.config("configuration key", "configuration value")
.getOrCreate
val sqlContext = spark.sqlContext
val tdf2 = spark.read.option("delimiter", "|").option("header", true).csv("./tst.dat")
tdf2.show()
}