pysparkjarsnowflake-cloud-data-platformaws-glueaws-glue-connection

'spark.jars.packages' not working as expected in AWS Glue and Spark


I want to use some Maven repository JAR files in my Spark session so I am creating the session with 'spark.jars.packages' which would automatically download the JARs. This is not working as expected as I am having the Session config correctly configured (('spark.jars.packages', 'net.snowflake:snowflake-jdbc:3.13.6,net.snowflake:spark-snowflake_2.12:2.9.0-spark_3.1'),).

But I still have the error: "Failed to find data source: net.snowflake.spark.snowflake. Please find packages at https://spark.apache.org/third-party-projects.html" which would be solved if I upload the JARs manually.

I am using Glue v4.

If I update the JARs manually it is working I need them to download automatically.

What can I try next?


Solution

  • Glue doesn't allow dynamic loading of packages using "spark.jars.packages".

    To add dependencies need to use the magics %additional_python_modules and %extra_jars In the case of Python you can reference directly to pip modules but in the case of the jars, it doesn't accept maven coordinates, unfortunately, you need to get the jars, put then on s3 and then reference then using %extra_jars.