Created a spark cluster through gcloud console with following options
gcloud dataproc clusters create cluster-name --region us-east1 --num-masters 1 --num-workers 2 --master-machine-type n1-standard-2 --worker- machine-type n1-standard-1 --metadata spark-packages=graphframes:graphframes:0.2.0-spark2.1-s_2.11
On spark master node - launched pyspark shell as follows:
pyspark --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11
...
found graphframes#graphframes;0.2.0-spark2.0-s_2.11 in spark-packages
[SUCCESSFUL ] graphframes#graphframes;0.2.0-spark2.0-s_2.11!graphframes.jar (578ms)
...
graphframes#graphframes;0.2.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
...
Using Python version 2.7.9 (default, Jun 29 2016 13:08:31) SparkSession available as 'spark'.
>>> from graphframes import *
Traceback (most recent call last): File "", line 1, in ImportError: No module named graphframes
How do I load graphframes on gcloud dataproc spark cluster?
Seems to be a known issue that you have jump through hoops to get it working in pyspark
: https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172