javascalamavenapache-sparkhadoop

Spark Unit test failing maven test but pass in IntelliJ


I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire.

I have a shared test session setup like this:

SharedSparkSession trait (base)

SharedUnitTestSparkSession (no Hive support, lightweight)

SharedHiveSparkSession (enables Hive, uses Derby metastore)

Some tests need Hive support (e.g., creating temp views, running SQL queries) and some don’t.

The Issue:

Errors is one thing only:

Hadoop home dir / filesystem implementation errors (HADOOP_HOME / winutils.exe issues)

I've tried:

spark.sparkContext.hadoopConfiguration
  .setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])

this is set in both the Unit and SparkHive traits:

trait SharedSparkSession extends BeforeAndAfterAll { this: Suite =>
  @transient protected var spark: SparkSession = _

  override def beforeAll(): Unit = {
    super.beforeAll()

    spark = SparkSession.builder()
      .appName("Unit Test Session")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.hadoopConfiguration
      .setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
  }

  override def afterAll(): Unit = {
    if (spark != null) {
      spark.stop()
      spark = null
//clears active sessions also
    }
    super.afterAll()
  }
}

The hive one is set as follows:

trait SharedHiveSparkSession extends SharedSparkSession { this: Suite =>
  override def beforeAll(): Unit = {
    super.beforeAll()

    val tempDir = java.nio.file.Files.createTempDirectory("spark_metastore").toAbsolutePath.toString
    spark = SparkSession.builder()
      .appName("Hive Test Session")
      .master("local[*]")
      .config("spark.sql.warehouse.dir", s"$tempDir/warehouse")
      .config("javax.jdo.option.ConnectionURL", s"jdbc:derby:metastore_db_${UUID.randomUUID().toString};create=true")
      .enableHiveSupport()
      .getOrCreate()

    spark.sparkContext.hadoopConfiguration
      .setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
  }
}

and the unit test one looks like this:

trait SharedUnitTestSparkSession extends SharedSparkSession { this: Suite =>
  override def beforeAll(): Unit = {
    super.beforeAll()

    spark = SparkSession.builder()
      .appName("Unit Test Spark Session")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.hadoopConfiguration
      .setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
  }
}

Then the test:

class MyIntegrationTest extends AnyFlatSpec with SharedHiveSparkSession {

  "A Hive-backed DataFrame" should "perform test operation" in {
    implicit val implicitSpark: SparkSession = spark

    import spark.implicits._

    val df = Seq((1, "foo"), (2, "bar")).toDF("id", "value")
    df.createOrReplaceTempView("my_table")

    val result = spark.sql("SELECT COUNT(*) as total FROM my_table").collect().head.getLong(0)
    assert(result == 2)
  }
}

the test doesn't run failing with errors hadoop home directory is unset which I've bypassed with BareLocalFileSystem implementation.

What I’m looking for:

Some codee suggestions or pinpoint to what I am missing in maven or configuration in general will really help :)

Versions:


Solution

  • Hadoop’s FileSystem cache was hanging around in the JVM across test suites, even after SparkContext.stop(), because by design Hadoop doesn’t automatically evict cached FileSystem instances per scheme when you shut down Spark.

    So adding:
    FileSystem.closeAll() on beforeAll() on SharedHiveSparkSession working for me.

    So it is wiping out any lingering FileSystem handles (like LocalFileSystem) that might have been instantiated in a previous test suite’s context.