scalaapache-sparkammonite

Spark unable to find "spark-version-info.properties" when run from ammonite script


I have an ammonite script which creates a spark context:

#!/usr/local/bin/amm

import ammonite.ops._

import $ivy.`org.apache.spark:spark-core_2.11:2.0.1`

import org.apache.spark.{SparkConf, SparkContext}

@main
def main(): Unit = {
  val sc =  new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
}

When I run this script, it throws an error:

Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: org.apache.spark.SparkException: Error while locating file spark-version-info.properties
...
Caused by: java.lang.NullPointerException
    at java.util.Properties$LineReader.readLine(Properties.java:434)
    at java.util.Properties.load0(Properties.java:353)

The script isn't being run from the spark installation directory and doesn't have any knowledge of it or the resources where this version information is packaged - it only knows about the ivy dependencies. So perhaps the issue is that this resource information isn't on the classpath in the ivy dependencies. I have seen other spark "standalone scripts" so I was hoping I could do the same here.


I poked around a bit to try and understand what was happening. I was hoping I could programmatically hack some build information into the system properties at runtime.

The source of the exception comes from package.scala in the spark library. The relevant bits of code are

val resourceStream = Thread.currentThread().getContextClassLoader.
  getResourceAsStream("spark-version-info.properties")

try {
  val unknownProp = "<unknown>"
  val props = new Properties()
  props.load(resourceStream) <--- causing a NPE?
  (
    props.getProperty("version", unknownProp),
    // Load some other properties
  )
} catch {
  case npe: NullPointerException =>
    throw new SparkException("Error while locating file spark-version-info.properties", npe)

It seems that the implicit assumption is that props.load will fail with a NPE if the version information can't be found in the resources. (That's not so clear to the reader!)

The NPE itself looks like it's coming from this code in java.util.Properties.java:

class LineReader {
    public LineReader(InputStream inStream) {
        this.inStream = inStream;
        inByteBuf = new byte[8192];
    }

    ...
    InputStream inStream;
    Reader reader;

    int readLine() throws IOException {
      ...
      inLimit = (inStream==null)?reader.read(inCharBuf)
                                :inStream.read(inByteBuf);

The LineReader is constructed with a null InputStream which the class internally interprets as meaning that the reader is non-null and should be used instead - but it's also null. (Is this kind of stuff really in the standard library? Seems very unsafe...)


From looking at the bin/spark-shell that comes with spark, it adds -Dscala.usejavacp=true when it launches spark-submit. Is this the right direction?

Thanks for your help!


Solution

  • Following seems to work on 2.11 with 1.0.1 version but not experimental.

    Could be just better implemented on Spark 2.2

    #!/usr/local/bin/amm
    
    import ammonite.ops._
    import $ivy.`org.apache.spark:spark-core_2.11:2.2.0` 
    import $ivy.`org.apache.spark:spark-sql_2.11:2.2.0`
    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.sql._ 
    import org.apache.spark.sql.SparkSession
    
    @main
    def main(): Unit = {
        val sc =  new SparkContext(new SparkConf().setMaster("local[2]").setAppName("Demo"))
    }
    

    or more expanded answer:

    @main
    def main(): Unit = {
        val spark = SparkSession.builder()
          .appName("testings")
          .master("local")
          .config("configuration key", "configuration value")
          .getOrCreate 
      val sqlContext = spark.sqlContext
      val tdf2 = spark.read.option("delimiter", "|").option("header", true).csv("./tst.dat")
      tdf2.show()
    }