I am trying to build a containerized mini batch data processing pipeline using PySpark and Docker after which the processed data would be stored in Cassandra. I am utilizing a docker-compose file for pulling the images for spark and Cassandra, I am able to run my pyspark file without errors, however I have erros when trying to run the cassandra lines such as creating keyspace and tables which is why I tried to use cqlsh in the container after which I got the following error
Connection error: ('Unable to connect to any servers', { \
'127.0.0.1:9042': ConnectionRefusedError(111, "Tried connecting to \
[('127.0.0.1', 9042)]. Last error: Connection refused")})
docker commands: -
docker compose up -d
docker ps
docker exec -it container-id cqlsh: I get the error after this command
I have tried pulling various types of Cassandra images which comes up with the same error and I have checked several sources to identify how to schedule this using airflow in the container to no avail
I used the following docker-compose: -
version: '3'
networks:
app-tier:
driver: bridge
services:
spark:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
volumes:
- ".:/opt/spark"
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
networks:
- app-tier
cassandra:
image: 'bitnami/cassandra:latest'
#image: docker.io/bitnami/cassandra:4.1
#image: cassandra:latest
ports:
- '7000:7000'
- '127.0.0.1:9042:9042'
volumes:
#- 'cassandra_data:/bitnami'
- ".:/opt/cassandra"
environment:
- CASSANDRA_SEEDS=cassandra
- CASSANDRA_PASSWORD_SEEDER=yes
- CASSANDRA_PASSWORD=cassandra
networks:
- app-tier ```
Docker containers run in their own network so when you connect to a container, you need to specify the network to connect to.
In your case, you've named the network app-tier
so specify the network in your command with --network app-tier
.
Additionally, you also need to specify the name of the container you're connecting to. You can find out the container's name from the docker ps
output.
If you're interested, the Quickstart Guide on the official Apache Cassandra website has detailed steps for running Cassandra in Docker.
Finally, I'd highly recommend you spend some time learning Docker first so you understand the basics. Otherwise, you will be wasting a lot of time running into other simple issues unrelated to Cassandra or Spark. Cheers!