apache-sparkpysparkgcloudgoogle-cloud-dataprocgraphframes

Unable to import graphframes in pyspark shell on gcloud dataproc spark cluster


Created a spark cluster through gcloud console with following options

gcloud dataproc clusters create cluster-name --region us-east1 --num-masters 1 --num-workers 2 --master-machine-type n1-standard-2 --worker- machine-type n1-standard-1 --metadata spark-packages=graphframes:graphframes:0.2.0-spark2.1-s_2.11

On spark master node - launched pyspark shell as follows:

pyspark --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11

...

found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in spark-packages

[SUCCESSFUL ] graphframes#graphframes;0.2.0-spark2.0-s_2.11!graphframes.jar (578ms)

...

    graphframes#graphframes;0.2.0-spark2.0-s_2.11 from spark-packages in [default]
    org.scala-lang#scala-reflect;2.11.0 from central in [default]
    org.slf4j#slf4j-api;1.7.7 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   5   |   5   |   5   |   0   ||   5   |   5   |
    ---------------------------------------------------------------------

...

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'.

>>> from graphframes import *

Traceback (most recent call last): File "", line 1, in ImportError: No module named graphframes

How do I load graphframes on gcloud dataproc spark cluster?


Solution

  • Seems to be a known issue that you have jump through hoops to get it working in pyspark: https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172