apache-sparkpysparkazure-synapsepython-wheelgraphframes

GraphFrames for pyspark in Azure Synapse


I'm trying to run the basic graphframes python sample on Azure Synapse. The works fine when I upload the correct .jar file from here and write the code in scala. But the same .jar file doesn't get picked up when running the python version of the code (it throws a ModuleNotFoundError). Implied in the Azure Synapse docs, python packages should only be uploaded as .whl files. However, there doesn't seem to be graphframes wheels file for any version but the 0.6.0 one found in pip (which doesn't support spark 3.x).

So the question is, how can I get graphframes working on synapse?

Alternatively, how can I create a .whl file from the matching .jar?


Solution

  • So, apparently, no need for jars or wheels or fancy table sets. Add this in the beginning of the notebook:

    %%configure -f
    {    
        "conf": {
            "spark.jars.packages": "graphframes:graphframes:0.8.2-spark3.2-s_2.12"
        }
    }
    

    EDIT: Apparently, a better way to do this is to configure the above in your spark pool, otherwise every time this cell runs, the spark session will restart (~40 seconds).