I'm working on a Scala project using Spark (with Hive support in some tests) and running unit and integration tests via both IntelliJ and Maven Surefire.
I have a shared test session setup like this:
SharedSparkSession
trait (base)
SharedUnitTestSparkSession
(no Hive support, lightweight)
SharedHiveSparkSession
(enables Hive, uses Derby metastore)
Some tests need Hive support (e.g., creating temp views, running SQL queries) and some don’t.
The Issue:
Tests pass individually in IntelliJ
Tests pass when running all tests in IntelliJ
Test pass in Gitlab CI server probably as Unix
Tests fail when running mvn clean test via command line locally / intellij
Errors is one thing only:
Hadoop home dir / filesystem implementation errors (HADOOP_HOME / winutils.exe issues)
I've tried:
spark.sparkContext.hadoopConfiguration
.setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
this is set in both the Unit and SparkHive traits:
trait SharedSparkSession extends BeforeAndAfterAll { this: Suite =>
@transient protected var spark: SparkSession = _
override def beforeAll(): Unit = {
super.beforeAll()
spark = SparkSession.builder()
.appName("Unit Test Session")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration
.setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
}
override def afterAll(): Unit = {
if (spark != null) {
spark.stop()
spark = null
//clears active sessions also
}
super.afterAll()
}
}
The hive one is set as follows:
trait SharedHiveSparkSession extends SharedSparkSession { this: Suite =>
override def beforeAll(): Unit = {
super.beforeAll()
val tempDir = java.nio.file.Files.createTempDirectory("spark_metastore").toAbsolutePath.toString
spark = SparkSession.builder()
.appName("Hive Test Session")
.master("local[*]")
.config("spark.sql.warehouse.dir", s"$tempDir/warehouse")
.config("javax.jdo.option.ConnectionURL", s"jdbc:derby:metastore_db_${UUID.randomUUID().toString};create=true")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.hadoopConfiguration
.setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
}
}
and the unit test one looks like this:
trait SharedUnitTestSparkSession extends SharedSparkSession { this: Suite =>
override def beforeAll(): Unit = {
super.beforeAll()
spark = SparkSession.builder()
.appName("Unit Test Spark Session")
.master("local[*]")
.getOrCreate()
spark.sparkContext.hadoopConfiguration
.setClass("fs.file.impl", classOf[org.apache.hadoop.fs.BareLocalFileSystem], classOf[FileSystem])
}
}
Then the test:
class MyIntegrationTest extends AnyFlatSpec with SharedHiveSparkSession {
"A Hive-backed DataFrame" should "perform test operation" in {
implicit val implicitSpark: SparkSession = spark
import spark.implicits._
val df = Seq((1, "foo"), (2, "bar")).toDF("id", "value")
df.createOrReplaceTempView("my_table")
val result = spark.sql("SELECT COUNT(*) as total FROM my_table").collect().head.getLong(0)
assert(result == 2)
}
}
the test doesn't run failing with errors hadoop home directory is unset which I've bypassed with BareLocalFileSystem
implementation.
What I’m looking for:
A reliable way to isolate Spark Hive sessions during Maven test runs
Clean strategies to avoid Derby metastore lock issues and filesystem config conflicts
Some codee suggestions or pinpoint to what I am missing in maven or configuration in general will really help :)
Versions:
Hadoop’s FileSystem
cache was hanging around in the JVM across test suites, even after SparkContext.stop()
, because by design Hadoop doesn’t automatically evict cached FileSystem
instances per scheme when you shut down Spark.
So adding:
FileSystem.closeAll()
on beforeAll()
on SharedHiveSparkSession
working for me.
So it is wiping out any lingering FileSystem
handles (like LocalFileSystem
) that might have been instantiated in a previous test suite’s context.