pysparkaws-glue

How to enable pySpark in Glue ETL?


I have a very simple Glue ETL Job with the following code:

from pyspark.context import SparkContext

sc = SparkContext.getOrCreate()
conf = sc.getConf()

print(conf.toDebugString())

The Job is created with a Redshift connection enabled. When executing the Job I get:

No module named pyspark.context

The public documentations all seem to mention, point, and imply the availability of pyspark, but why is my environment complaining that it doesn't have pyspark? What steps am I missing?

Best Regards, Lim


Solution

  • Python Shell jobs only support Python and libraries like pandas, Scikit-learn, etc. They don't have support for PySpark, so you should create one with job type = Spark and ETL language = Python in order to make it work.