mill

Should mill tasks with non-file products produce something analogous to PathRef?


I'm using mill to build a pipeline that

  1. cleans up a bunch of CSV files (producing new files)
  2. loads them into a database
  3. does more work in the database (create views, etc)
  4. runs queries to extract some files.

Should the tasks associated with steps 2 and 3 be producing something analogous to PathRef? If so, what? They aren't producing a file on the disk but nevertheless should not be repeated unless the inputs change. Similarly, tasks associated with step 3 should run if tasks in step 2 are run again.

I see in the documentation for targets that you can return a case class and that re-evaluation depends on the .hashCode of the target's return value. But I'm not sure what to do with that information.

And a related question: Does mill hash the code in each task? It seems to be doing the right thing if I change the code for one task but not others.


Solution

  • A (cached) task in mill is re-run, when the build file (build.sc or it's dependencies/includes) or inputs/dependencies of that task change. Whenever you construct a PathRef, a checksum of the path content is calculated and used as hashCode. This makes it possible to detect changes and only act if anything has changed.

    Of course there are exceptions, e.g. input tasks (created with T.input or T.sources) and commands (created with T.command) will always run.

    It is in general a good idea to return something from a task. A simple String or Int will do, e.g. to show it in the shell with mill show myTask or post-process it later. Although I think a task running something in an external database should be implemented as a command or input task (which might check, when running, if it really needs something to do), you can also implement it as cached task. But please keep in mind, that the state will only be correct if no other process/user is changing the database in between.

    That aside, You could return a current database schema/data version or a last change date in step 2. Make sure, it changes whenever the database was modified. Each change of that return value would trigger step 3 and other dependent tasks. (Of course, step 3 needs to depend on step 2.) As a bonus, you could return the same (previous/old) value in case you did not change the database, which would then avoid any later work in dependent tasks.