jupyterdaskdask-distributeddask-dataframe

Can't dd.read_sql on jupyter, kernel crashes


I'm coming here because I don't understand my problem. I created a dockerfile + compose which creates 1 dask scheduler and 2 workers:

docker-compose.yaml:

version: '3.8'

services:
  dask-scheduler:
    build:
      context: .
      dockerfile: dask.Dockerfile
    command: ["dask", "scheduler", "--host", "0.0.0.0"]
    ports:
      - "50101:8786"
      - "50100:8787"
    networks:
      - default

  dask-worker:
    build:
      context: .
      dockerfile: dask.Dockerfile
    command: ["dask", "worker", "dask-scheduler:8786", "--memory-limit", "4G"]
    deploy:
      mode: replicated
      replicas: 2
    networks:
      - default

dask.Dockerfile

FROM python:3.11.0-bullseye

RUN apt update -y && \
    apt upgrade -y

RUN apt-get install -y \
    rustc \
    libpq-dev

RUN pip install --upgrade pip

RUN pip install setuptools_rust

RUN pip install \
    dask[complete] \
    bokeh \
    lz4

EXPOSE 8786
EXPOSE 8787

When I connect to the client from a Notebook, i have no problem. I can even run a test with: client.submit(np.random.random, 2903192, pure=False).key

But when I try to read_sql, the kernel crashes.

On the scheduler, I only get this:

dask-scheduler-1  | 2024-01-22 10:11:09,823 - distributed.scheduler - INFO - Receive client connection: Client-9063b1d4-b90e-11ee-9f28-a652689ec955
dask-scheduler-1  | 2024-01-22 10:11:09,824 - distributed.core - INFO - Starting established connection to tcp://192.168.65.1:56693
dask-scheduler-1  | 2024-01-22 10:11:12,921 - distributed.core - INFO - Connection to tcp://192.168.65.1:56693 has been closed.
dask-scheduler-1  | 2024-01-22 10:11:12,921 - distributed.scheduler - INFO - Remove client Client-9063b1d4-b90e-11ee-9f28-a652689ec955
dask-scheduler-1  | 2024-01-22 10:11:12,922 - distributed.scheduler - INFO - Close client connection: Client-9063b1d4-b90e-11ee-9f28-a652689ec955

Nothing is sent to any worker.

Here's the read_sql code:

df = dd.read_sql_table(
    table_name="table",
    index_col='stock_qty',
    con="postgresql+psycopg2://username:password@IP:PORT/RAW"
)

Do you know what could be the problem?


Solution

  • I resolved the problem myself.

    I created a new anaconda environment and the problem resolved by itself.