mavenapache-sparkivygithub-package-registry

How do I add a GitHub Package repository when executing spark-submit --repositories?


I'm trying to consume in Spark a package published to GitHub Packages here. This package is meant to be used in Spark with a command such as this:

spark-submit --master "local[*]" \
  --class com.target.data_validator.Main \
  --packages com.target:data-validator_2.11:0.13.2 \
  --repositories "https://${GITHUB_USER}:${GITHUB_PACKAGES_TOKEN}@maven.pkg.github.com/target/data-validator" \
  empty.file \
  --config local_validators.yaml \
  --jsonReport report.json \
  --htmlReport report.html

When I run this, I get the ol' not found error:

[unresolved dependency: com.target#data-validator_2.11;0.13.2: not found]

Looking at the URLs that Spark tried for resolution, I can see the repository I've specified in the command, along with the correct credentials specified above as variables. If I copy-paste the URL that spark-submit claims to have tried, it works in a browser, with wget, and with curl. I speculate that Spark is having trouble handling the credentials or the redirect(s) that GHP provides before giving the final data URL.

I also tried creating an Ivy settings file and passing it to the spark-submit command with --conf spark.jars.ivySettings=dv-ivy.xml:

<ivysettings>
    <settings defaultResolver="ghp-dv"/>
    <credentials 
        host="maven.pkg.github.com"
        realm="GitHub Package Registry"
        username="${GITHUB_USER}"
        passwd="${GITHUB_PACKAGES_TOKEN}"/>
        <!-- real credentials hardcoded -->
    <resolvers>
        <ibiblio
            name="ghp-dv"
            m2compatible="true"
            root="https://maven.pkg.github.com/target/data-validator"/>
    </resolvers>
</ivysettings>

This works, but then none of my cluster's Ivy settings are respected. I really don't want to have to copy, insert, and manage a fork of that configuration in this job. It's not ergonomic, so I'm still in search of a solution.

How can I properly configure Spark to resolve and retrieve the package from GitHub Packages without having to duplicate Ivy settings?

I am using Spark 2.3.1 and cannot use anything newer right now for reasons outside of my control.


Solution

  • I solved this by creating a chain inside of the ivysettings XML file:

    <ivysettings>
        <settings defaultResolver="thechain">
            <credentials
                host="maven.pkg.github.com"
                username="${GITHUB_USER}"
                passwd="${GITHUB_USER_READPACKAGES_TOKEN}"
                realm="GitHub Package Registry"/>
        </settings>
        <resolvers>
            <chain name="thechain">
                <ibiblio name="central" m2compatible="true" 
                    root="https://repo1.maven.org/maven2/" />
                <ibiblio name="ghp-dv" m2compatible="true" 
                    root="https://maven.pkg.github.com/target/data-validator"/>
            </chain>
        </resolvers>
    </ivysettings>
    

    Ensure that the token has read:packages permission and consider having only that permission for that token if it'll be on a shared system.

    Invoke spark-submit with:

    spark-submit --master "local[*]" \
      --class com.target.data_validator.Main \
      --packages com.target:data-validator_2.11:0.14.1 \
      --conf spark.jars.ivySettings=$(pwd)/dv-ivy.xml \
      empty.file \
      --config local_validators.yaml \
      --jsonReport report.json \
      --htmlReport report.html
    

    N.b. the need for an empty.file (touch empty.file) and you'll need to adjust the path to the dv-ivy.xml file containing the XML above.