I want to connect to postgres db using Dask where the database is behind an SSH tunnel. While there are methods to create SSH tunnels in Python, I haven't found a straightforward way to integrate this with Dask's connection mechanisms.
One potential solution is to locally port forward the database connection and then connect to localhost:port, where the tunnel is established. However, I'm uncertain about how this approach will function within a Dask cluster. Since the SSH tunnel is created on the node where the tunnel code is executed, it might not be accessible on Dask workers.
I'm currently using the
dd.read_sql_query(query, connection_string)
method for reading SQL queries in Dask. I'm considering whether I need to create the SSH tunnel on each worker node using
client.run(create_ssh_tunnel)
However, I'm unsure about how this will interact with auto-scaling. Specifically, during periods of high load when Dask workers autoscale, will they first create the tunnel on the worker nodes?
during periods of high load when Dask workers autoscale, will they first create the tunnel on the worker nodes
No, client.run
happens once when you call it, on the workers that are connected at the time. If you want each worker to do something at launch, you want a Nanny (if using) or Worker plugin, see https://distributed.dask.org/en/stable/plugins.html . You would define a setup() method calling your SSH creation function. You should ensure that you handle possible error cases (clean up on teardown, and cope with the tunnel already being open).
Minimal example:
class WorkerSSH(WorkerPlugin):
def setup(self, worker):
create_ssh_tunnel()
plugin = WorkerSSH(logging)
client.register_plugin(plugin)