pysparkhivespark2.4.4

spark not downloading hive_metastore jars


Environment

I am using spark v2.4.4 via the python API

Problem

According to the spark documentation I can force spark to download all the hive jars for interacting with my hive_metastore by setting the following config

However, when I run the following python code, no jar files are downloaded from maven.

   from pyspark.sql import SparkSession
   from pyspark import SparkConf
   conf = (
       SparkConf()
       .setAppName("myapp")
       .set("spark.sql.hive.metastore.version", "2.3.3")
       .set("spark.sql.hive.metastore.jars","maven")
   )
   spark = (
       SparkSession
       .builder
       .config(conf=conf)
       .enableHiveSupport()
       .getOrCreate()
   )

How do I know that no jar files are downloaded?

  1. I have configured logLevel=INFO as a default by setting log4j.logger.org.apache.spark.api.python.PythonGatewayServer=INFO in $SPARK_HOME/conf/log4j.properties. I can see no logging which says that spark is interacting with maven. according to this I should see an INFO level log
  2. Even if for some reason my logging was broken, the SparkSession object is simply building too quickly to be pulling large jars from maven. It returns in under 5 seconds. If I manually add the maven coordinates of hive_metastore to "spark.jars.packages" it takes minutes to download it all
  3. I have deleted ~/.ivy2 and ~/.m2 directories to remove caching of previous downloads

Other tests


Solution

  • For anyone else trying to solve this: