pysparkamazon-emrapache-sedona

how to build pyspark emr app using python to spin and apply the steps?


I am trying to update the version of apache-sedona to start using 1.7.1 version, but it keeps failing when being spined using python functions.

If i spin the cluster manually everything works fine and not errors but if I use external function it keep failing. I have followed what says in their site, but its not running well and fails when running the step in 30 seconds.

I am spinning a cluster EMR-6.9.0(hadooop 3.3.3 and spark Spark 3.3.0) I have tried also with other emr versions but the result is the same thought matching spark version to the jars I was using it will solved it but that's not the case.

the jars I am using are: a) postgresql-42.7.3.jar b) geotools-wrapper-1.7.1-28.5.jar c) sedona-spark-shaded-3.3_2.12-1.7.1.jar

Copying them to /usr/lib/spark/jars form a s3 bucket using sudo aws cp ..

In the bootstrap i have also added:

export PATH="/hadoop/local/bin:$PATH"

export SPARK_HOME=/usr/lib/spark

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

along with the sudo pyhton3 -m pip install listed on the apache-sedona website.

I am submitting the steps using a function whcih fetch the location of the step form a config file and cosntructing the submit like that:

step = {
        'Name': step_name,
        'ActionOnFailure': step_config['actiononfail'],
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                'spark-submit',
                '--deploy-mode', 'cluster',
                '--conf', 'spark.serializer=org.apache.spark.serializer.KryoSerializer',
                '--conf', 'spark.kryo.registrator=org.apache.sedona.core.serde.SedonaKryoRegistrator',
                step_config['stepscriptlocation']
            ]
        }
    }

the cluster I spin it using the config:

    [
  {
    "Classification":"spark-defaults",
    "Properties":{
      "spark.yarn.dist.jars": "/usr/lib/spark/jars/sedona-spark-shaded-3.3_2.12-1.7.1.jar,/usr/lib/spark/jars/geotools-wrapper-1.7.1-28.5.jar",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
      "spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
      }
  }
]

and the pyspark script I declare it like:

        from sedona.spark import *

config = SedonaContext.builder() \
    .config('spark.jars.packages',
            'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
            'org.datasyslab:geotools-wrapper:1.7.1-28.5,'
            'org.postgresql:postgresql:42.7.3') \
    .getOrCreate()

# Create Sedona context
sedona = SedonaContext.create(config)

** this are the other libraries I'm bringing in the bootstrap to use it in the spark app in case its conflicting, but don't think so as manually spined is working:

sudo python3 -m pip install pandas
sudo python3 -m pip install shapely
sudo python3 -m pip install geopandas
sudo python3 -m pip install keplergl==0.3.2
sudo python3 -m pip install pydeck==0.8.0
sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.7.1
sudo python3 -m pip install sqlalchemy==1.4.45
sudo python3 -m pip install psycopg2-binary
sudo python3 -m pip install requests==2.25.1
sudo python3 -m pip install pyarrow
sudo python3 -m pip install numpy
sudo python3 -m pip install dask
sudo python3 -m pip install h3
sudo python3 -m pip install boto3
sudo python3 -m pip install slackclient
sudo python3 -m pip install geoalchemy2
sudo python3 -m pip install dask-geopandas
sudo python3 -m pip install word2number
sudo python3 -m pip install wordtodigits
sudo python3 -m pip install numwords_to_nums
sudo python3 -m pip install h3-pyspark
sudo python3 -m pip install fsspec
sudo python3 -m pip install fuzzywuzzy
sudo python3 -m pip install levenshtein
sudo python3 -m pip install --upgrade certifi

So I think is related to the jar or how I am passing them to the cluster.

I am really stuck here so any help would be appreciated


Solution

  • After several retries and errors I think the error was passing:

        [
      {
        "Classification":"spark-defaults",
        "Properties":{
          "spark.yarn.dist.jars": "/usr/lib/spark/jars/sedona-spark-shaded-3.3_2.12-1.7.1.jar,/usr/lib/spark/jars/geotools-wrapper-1.7.1-28.5.jar",
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
          "spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
          }
      }
    ]
    

    Instead I submitted as normal step and it work through:

    {
                'Name': step_name,
                'ActionOnFailure': step_config['actiononfail'],
                'HadoopJarStep': {
                    'Jar': 'command-runner.jar',
                    'Args': ['spark-submit', '--deploy-mode', 'cluster', step_config['stepscriptlocation']]
                }
            }