I am trying to update the version of apache-sedona to start using 1.7.1 version, but it keeps failing when being spined using python functions.
If i spin the cluster manually everything works fine and not errors but if I use external function it keep failing. I have followed what says in their site, but its not running well and fails when running the step in 30 seconds.
I am spinning a cluster EMR-6.9.0(hadooop 3.3.3 and spark Spark 3.3.0) I have tried also with other emr versions but the result is the same thought matching spark version to the jars I was using it will solved it but that's not the case.
the jars I am using are: a) postgresql-42.7.3.jar b) geotools-wrapper-1.7.1-28.5.jar c) sedona-spark-shaded-3.3_2.12-1.7.1.jar
Copying them to /usr/lib/spark/jars form a s3 bucket using sudo aws cp ..
In the bootstrap i have also added:
export PATH="/hadoop/local/bin:$PATH"
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
along with the sudo pyhton3 -m pip install listed on the apache-sedona website.
I am submitting the steps using a function whcih fetch the location of the step form a config file and cosntructing the submit like that:
step = {
'Name': step_name,
'ActionOnFailure': step_config['actiononfail'],
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
'--conf', 'spark.serializer=org.apache.spark.serializer.KryoSerializer',
'--conf', 'spark.kryo.registrator=org.apache.sedona.core.serde.SedonaKryoRegistrator',
step_config['stepscriptlocation']
]
}
}
the cluster I spin it using the config:
[
{
"Classification":"spark-defaults",
"Properties":{
"spark.yarn.dist.jars": "/usr/lib/spark/jars/sedona-spark-shaded-3.3_2.12-1.7.1.jar,/usr/lib/spark/jars/geotools-wrapper-1.7.1-28.5.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
}
}
]
and the pyspark script I declare it like:
from sedona.spark import *
config = SedonaContext.builder() \
.config('spark.jars.packages',
'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
'org.datasyslab:geotools-wrapper:1.7.1-28.5,'
'org.postgresql:postgresql:42.7.3') \
.getOrCreate()
# Create Sedona context
sedona = SedonaContext.create(config)
** this are the other libraries I'm bringing in the bootstrap to use it in the spark app in case its conflicting, but don't think so as manually spined is working:
sudo python3 -m pip install pandas
sudo python3 -m pip install shapely
sudo python3 -m pip install geopandas
sudo python3 -m pip install keplergl==0.3.2
sudo python3 -m pip install pydeck==0.8.0
sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.7.1
sudo python3 -m pip install sqlalchemy==1.4.45
sudo python3 -m pip install psycopg2-binary
sudo python3 -m pip install requests==2.25.1
sudo python3 -m pip install pyarrow
sudo python3 -m pip install numpy
sudo python3 -m pip install dask
sudo python3 -m pip install h3
sudo python3 -m pip install boto3
sudo python3 -m pip install slackclient
sudo python3 -m pip install geoalchemy2
sudo python3 -m pip install dask-geopandas
sudo python3 -m pip install word2number
sudo python3 -m pip install wordtodigits
sudo python3 -m pip install numwords_to_nums
sudo python3 -m pip install h3-pyspark
sudo python3 -m pip install fsspec
sudo python3 -m pip install fuzzywuzzy
sudo python3 -m pip install levenshtein
sudo python3 -m pip install --upgrade certifi
So I think is related to the jar or how I am passing them to the cluster.
I am really stuck here so any help would be appreciated
After several retries and errors I think the error was passing:
[
{
"Classification":"spark-defaults",
"Properties":{
"spark.yarn.dist.jars": "/usr/lib/spark/jars/sedona-spark-shaded-3.3_2.12-1.7.1.jar,/usr/lib/spark/jars/geotools-wrapper-1.7.1-28.5.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
}
}
]
Instead I submitted as normal step and it work through:
{
'Name': step_name,
'ActionOnFailure': step_config['actiononfail'],
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'cluster', step_config['stepscriptlocation']]
}
}