I have a tiny on premises Spark 3.2.0 cluster, with one machine being master, and another 2 being slaves. The cluster is deployed on "bare metal" and everything works fine when I run pyspark from the master machine.
The problem happens when I try to run anything from another machine. Here is my code:
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("extrair_comex").config("spark.executor.memory", "1g").master("spark://srvsparkm-dev:7077").getOrCreate()
link = 'https://www.stats.govt.nz/assets/Uploads/International-trade/International-trade-September-2021-quarter/Download-data/overseas-trade-indexes-September-2021-quarter-provisional-csv.csv'
arquivo = pd.read_csv(link)
df_spark = spark.createDataFrame(arquivo.astype(str))
df_spark.write.mode('overwrite').parquet(f'hdfs://srvsparkm-dev:9000/lnd/arquivo_extraido_comex.parquet')
Where "srvsparkm-dev" is an alias for the spark master IP.
Checking the logs for the "extrair_comex" job, I see this:
The Spark Executor Command:
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-cp" "/home/spark/spark/conf/:/home/spark/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=38571" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@srvairflowcelery-dev:38571" "--executor-id" "157" "--hostname" "srvsparksl1-dev" "--cores" "2" "--app-id" "app-20220204183041-0031" "--worker-url" "spark://Worker@srvsparksl1-dev:37383"
The error:
Where "srvairflowcelery-dev" is the machine where the pyspark script is running.
Caused by: java.io.IOException: Failed to connect to srvairflowcelery-dev/xx.xxx.xxx.xx:38571
Where xx.xxx.xxx.xx is the srvairflowcelery-dev's IP.
It seems to me that the master is assigning to the client to run the task, and that's why it fails. What can I do about this? Can't I submit jobs from another machine?
I solved the problem. The problem was that the srvairflowcelery is on docker, so only some ports are open. Other than that, the spark master tries to communicate on a random port of the driver (srvairflowcelery), so having some ports closed is a problem.
What I did was:
airflow-worker:
<<: *airflow-common
command: celery worker
hostname: ${HOSTNAME}
ports:
- 8793:8793
- "51800-51900:51800-51900"
spark = SparkSession.builder.appName("extrair_comex_sb") \
.config("spark.executor.memory", "1g") \
.config("spark.driver.port", "51810") \
.config("spark.fileserver.port", "51811") \
.config("spark.broadcast.port", "51812") \
.config("spark.replClassServer.port", "51813") \
.config("spark.blockManager.port", "51814") \
.config("spark.executor.port", "51815") \
.master("spark://srvsparkm-dev:7077") \
.getOrCreate()
That fixed the problem.