I'm using mill
to build a pipeline that
Should the tasks associated with steps 2 and 3 be producing something analogous to PathRef
? If so, what? They aren't producing a file on the disk but nevertheless should not be repeated unless the inputs change. Similarly, tasks associated with step 3 should run if tasks in step 2 are run again.
I see in the documentation for targets that you can return a case class and that re-evaluation depends on the .hashCode
of the target's return value. But I'm not sure what to do with that information.
And a related question: Does mill
hash the code in each task? It seems to be doing the right thing if I change the code for one task but not others.
A (cached) task in mill is re-run, when the build file (build.sc
or it's dependencies/includes) or inputs/dependencies of that task change. Whenever you construct a PathRef
, a checksum of the path content is calculated and used as hashCode. This makes it possible to detect changes and only act if anything has changed.
Of course there are exceptions, e.g. input tasks (created with T.input
or T.sources
) and commands (created with T.command
) will always run.
It is in general a good idea to return something from a task. A simple String
or Int
will do, e.g. to show it in the shell with mill show myTask
or post-process it later. Although I think a task running something in an external database should be implemented as a command or input task (which might check, when running, if it really needs something to do), you can also implement it as cached task. But please keep in mind, that the state will only be correct if no other process/user is changing the database in between.
That aside, You could return a current database schema/data version or a last change date in step 2. Make sure, it changes whenever the database was modified. Each change of that return value would trigger step 3 and other dependent tasks. (Of course, step 3 needs to depend on step 2.) As a bonus, you could return the same (previous/old) value in case you did not change the database, which would then avoid any later work in dependent tasks.