Machine A
has the ability to access a SQL database and Machine B
has the ability to access Google Drive. How do I make sure that a task is run on the correct machine if UploadToDrive
depends on DownloadSQLData
somewhere down the line?
Currently Machine A
runs DoSomethingElseWithData
and Machine B
runs UploadToDrive
a few minutes later. This is fine up until the point where one day Machine A
might not be working, at which point Machine B
will attempt DownloadSQLData
as an upstream dependency and fail.
class DownloadSQLData(luigi.Task):
# ...
def run(self):
# Only Machine A can do this
# ...
class TransformData(luigi.Task):
# ...
def requires(self):
return DownloadSQLData(date=self.date)
class UploadToDrive(luigi.Task):
# ...
def requires(self):
return TransformData(date=self.date)
def run(self):
# Only Machine B can do this
# ...
class DoSomethingElseWithData(luigi.Task):
#...
def requires(self):
return TransformData(date=self.date)
The SQL database from this example is, in reality, not a SQL database but an old system within our company. It does not fail gracefully when unauthorised users try to access it and we'd like to avoid any attempts from Machine B
to do so.
Luigi itself cannot do scheduling, i.e., running certain tasks on certain machines or scheduling tasks to run at a certain time. That being said, there are many ways to achieve what you want.
Solution 1: Let's introduce machine C
that has access to machines A
and B
. Using a number of tools (https://wiki.python.org/moin/SecureShell) machine C
could run tasks to retrieve data from A
, transform it on C
, and then transfer to B
before uploading.
Solution 2: This solution is most likely too much work and/or infeasible. Set up machines A,B,C
in a network scheduler (something like slurm https://www.schedmd.com/) with C
as the head scheduler and specify A
and B
as certain types of resources (possibly SQL
and GDrive
). Then, from C
, schedule slurm tasks as luigi jobs (https://github.com/pharmbio/sciluigi can help with this). These slurm tasks should specify the given resources needed for each task. And that's it!