I'm trying to consume in Spark a package published to GitHub Packages here. This package is meant to be used in Spark with a command such as this:
spark-submit --master "local[*]" \
--class com.target.data_validator.Main \
--packages com.target:data-validator_2.11:0.13.2 \
--repositories "https://${GITHUB_USER}:${GITHUB_PACKAGES_TOKEN}@maven.pkg.github.com/target/data-validator" \
empty.file \
--config local_validators.yaml \
--jsonReport report.json \
--htmlReport report.html
When I run this, I get the ol' not found error:
[unresolved dependency: com.target#data-validator_2.11;0.13.2: not found]
Looking at the URLs that Spark tried for resolution, I can see the repository I've specified in the command, along with the correct credentials specified above as variables. If I copy-paste the URL that spark-submit
claims to have tried, it works in a browser, with wget
, and with curl
. I speculate that Spark is having trouble handling the credentials or the redirect(s) that GHP provides before giving the final data URL.
I also tried creating an Ivy settings file and passing it to the spark-submit
command with --conf spark.jars.ivySettings=dv-ivy.xml
:
<ivysettings>
<settings defaultResolver="ghp-dv"/>
<credentials
host="maven.pkg.github.com"
realm="GitHub Package Registry"
username="${GITHUB_USER}"
passwd="${GITHUB_PACKAGES_TOKEN}"/>
<!-- real credentials hardcoded -->
<resolvers>
<ibiblio
name="ghp-dv"
m2compatible="true"
root="https://maven.pkg.github.com/target/data-validator"/>
</resolvers>
</ivysettings>
This works, but then none of my cluster's Ivy settings are respected. I really don't want to have to copy, insert, and manage a fork of that configuration in this job. It's not ergonomic, so I'm still in search of a solution.
How can I properly configure Spark to resolve and retrieve the package from GitHub Packages without having to duplicate Ivy settings?
I am using Spark 2.3.1 and cannot use anything newer right now for reasons outside of my control.
I solved this by creating a chain inside of the ivysettings XML file:
<ivysettings>
<settings defaultResolver="thechain">
<credentials
host="maven.pkg.github.com"
username="${GITHUB_USER}"
passwd="${GITHUB_USER_READPACKAGES_TOKEN}"
realm="GitHub Package Registry"/>
</settings>
<resolvers>
<chain name="thechain">
<ibiblio name="central" m2compatible="true"
root="https://repo1.maven.org/maven2/" />
<ibiblio name="ghp-dv" m2compatible="true"
root="https://maven.pkg.github.com/target/data-validator"/>
</chain>
</resolvers>
</ivysettings>
Ensure that the token has read:packages
permission and consider having only that permission for that token if it'll be on a shared system.
Invoke spark-submit
with:
spark-submit --master "local[*]" \
--class com.target.data_validator.Main \
--packages com.target:data-validator_2.11:0.14.1 \
--conf spark.jars.ivySettings=$(pwd)/dv-ivy.xml \
empty.file \
--config local_validators.yaml \
--jsonReport report.json \
--htmlReport report.html
N.b. the need for an empty.file (touch empty.file
) and you'll need to adjust the path to the dv-ivy.xml
file containing the XML above.