haskellcachingnixshakeshake-build-system

Haskell Shake build: how can I set up a shared cache folder using shakeShare and/or shakeCloud?


I understand this is a new feature being worked on for GHC's Hadrian build system, so the workflow might be advanced, oddly specific, or still evolving. I read these so far:

It sounds like it should work for my use case: a domain-specific language for bioinformatics that would benefit greatly from caching comparisons between large genomes. I would be happy working from a basic minimal example or description of where to look. But I've included more details about my program too in case they make it easier...

The shortcut interpreter builds lots of artifacts with names derived from hashes of their inputs (somewhat Nix-like), and theoretically they should be portable across machines or even operating systems. A small program run might generate files + symlinks like this:

~/.shortcut
├── cache
│   ├── each
│   │   ├── 59e0192b1f
│   │   │   ├── 2633d268bf.ndb -> ../../../exprs/makeblastdb_nucl/2633d268bf_0.ndb
│   │   │   └── 2633d268bf.ndb.args -> ../../../cache/lines/3428ab5186.txt
│   │   ├── 623e07ac5b
│   │   │   ├── b9361606af.str.list -> ../../../exprs/extract_queries/b9361606af_0.str.list
│   │   │   └── b9361606af.str.list.args -> ../../../cache/lines/12ae82a598.txt
│   │   └── f477cfe47b
│   │       ├── 35e97350d8.bht -> ../../../exprs/blastn/ce1c174684_420e4f7fdf_35e97350d8_0.bht
│   │       └── 35e97350d8.bht.args -> ../../../cache/lines/9491bbe6a0.txt
│   ├── lines
│   │   ├── 0094e500eb.txt
│   │   ├── 12ae82a598.txt
│   │   ├── 246ddae0d8.txt
│   │   ├── 3428ab5186.txt
│   │   ├── 46767f8ae8.txt
│   │   ├── 5d54256d91.txt
│   │   ├── 61a97fd32d.txt
│   │   ├── 6de4b9ad67.txt
│   │   ├── 778251fd80.txt
│   │   ├── 81f7f42c42.txt
│   │   ├── 91ce94df26.txt
│   │   ├── 9491bbe6a0.txt
│   │   ├── b575c745e6.txt -> ../../cache/lines/6de4b9ad67.txt
│   │   ├── f094bac04c.txt
│   │   └── fcfb7a47a6.txt
│   ├── load
│   │   ├── 1e7afd22cf.gbk -> /home/jefdaj/shortcut/data/Mycoplasma_bovis_HB0801-P115.gbk
│   │   └── 28ce925871.gbk -> /home/jefdaj/shortcut/data/Mycoplasma_genitalium_M2321.gbk
│   ├── makeblastdb
│   │   └── 6de4b9ad67
│   │       ├── 6de4b9ad67.ndb.err
│   │       ├── 6de4b9ad67.ndb.nhr
│   │       ├── 6de4b9ad67.ndb.nin
│   │       ├── 6de4b9ad67.ndb.nsq
│   │       └── 6de4b9ad67.ndb.out
│   └── seqio
├── exprs
│   ├── any
│   │   └── e5677b1051_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   ├── blastn
│   │   ├── ce1c174684_420e4f7fdf_35e97350d8_0.bht -> ../../exprs/blastn/ce1c174684_420e4f7fdf_35e97350d8_0.bht.out
│   │   ├── ce1c174684_420e4f7fdf_35e97350d8_0.bht.out
│   │   └── ce1c174684_420e4f7fdf_35e97350d8_0.bht.out.err
│   ├── blastn_db
│   │   ├── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht -> ../../exprs/blastn_db/46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out
│   │   ├── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out
│   │   └── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out.err
│   ├── blastn_each
│   │   └── 46e62edec1_420e4f7fdf_2943ae4ea3_0.bht.list -> ../../cache/lines/12ae82a598.txt
│   ├── extract_queries
│   │   ├── 53376e198d_0.str.list -> ../../cache/lines/246ddae0d8.txt
│   │   ├── 53376e198d_0.str.list.tmp
│   │   ├── 53376e198d_0.str.list.tmp.err
│   │   ├── b9361606af_0.str.list -> ../../cache/lines/246ddae0d8.txt
│   │   ├── b9361606af_0.str.list.tmp
│   │   └── b9361606af_0.str.list.tmp.err
│   ├── extract_queries_each
│   │   └── d724d35317_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   ├── gbk_to_fna
│   │   ├── 262cf7e4e4_a355cc10e8_0.fna
│   │   └── 262cf7e4e4_cdab12f059_0.fna
│   ├── list
│   │   ├── 3bbbf950a3_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   │   ├── 65127d0127_0.fna.list -> ../../cache/lines/6de4b9ad67.txt
│   │   └── df91bd4d94_0.str.list.list.list -> ../../cache/lines/46767f8ae8.txt
│   ├── load_fna
│   ├── load_gbk
│   │   ├── 15e3d91521_0.gbk -> ../../cache/load/28ce925871.gbk
│   │   └── 74e27ec9a5_0.gbk -> ../../cache/load/1e7afd22cf.gbk
│   ├── makeblastdb_nucl
│   │   ├── 2633d268bf_0.ndb -> ../../cache/lines/81f7f42c42.txt
│   │   └── 954abc5fe7_0.ndb -> ../../cache/lines/81f7f42c42.txt
│   ├── makeblastdb_nucl_each
│   │   └── 209fe6406d_0.ndb.list -> ../../cache/lines/5d54256d91.txt
│   ├── num
│   │   └── a53b190835_0.num -> ../../cache/lines/778251fd80.txt
│   ├── singletons
│   │   └── 954abc5fe7_0.fna.list.list -> ../../cache/lines/0094e500eb.txt
│   └── str
│       ├── 90811d06ee_0.str -> ../../cache/lines/f094bac04c.txt
│       ├── b4da62b027_0.str -> ../../cache/lines/fcfb7a47a6.txt
│       └── b81c880be5_0.str -> ../../cache/lines/61a97fd32d.txt
├── profile.html
└── vars
    ├── mapped.str.list.list -> ../exprs/extract_queries_each/d724d35317_0.str.list.list
    ├── mbov.fna -> ../exprs/gbk_to_fna/262cf7e4e4_a355cc10e8_0.fna
    ├── mgen.fna -> ../exprs/gbk_to_fna/262cf7e4e4_cdab12f059_0.fna
    ├── result -> ../exprs/any/e5677b1051_0.str.list.list
    └── single.str.list -> ../exprs/extract_queries/53376e198d_0.str.list

27 directories, 64 files

I tried adding shakeShare = Just "sharedir" to my Shake options. When I run the build, delete all artifacts, and re-run, it fails to find a cached file:

error! Error when running Shake build system:
  at want, called at ./ShortCut/Core/Eval.hs:253:7 in main:ShortCut.Core.Eval
* Depends on: eval
  at need, called at ./ShortCut/Core/Eval.hs:256:25 in main:ShortCut.Core.Eval
* Depends on: /root/.shortcut/vars/result
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/all/b2ba759dce_0.str.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/any/0a89231af6_0.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/list/df91bd4d94_0.str.list.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/vars/mapped.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/extract_queries_each/730901fdda_0.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/tblastn_each/46e62edec1_183f436b85_a5a22079c6_0.bht.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/num/a53b190835_0.num
* Raised the exception:
/home/jefdaj/shortcut/sharedir/.shake.cache/2faae061b9976bed/0x134125AC: getPermissions:getFileStatus: does not exist (No such file or directory)

That was expected, but now how do I go about fixing it? The hashes should be stable because all symlinks are relative to the top level tmpdir (~/.shortcut here). And Shake should know about each of them since I make sure to call trackWrite.

Do I need to explicitly add all files to a cache using newCache, newCacheIO, and/or an oracle as they're built, or am I missing something simpler?

Ideally I would have a shared directory to cache everything done on the demo server, and also give users the option to connect their own instances to it like a Nix binary cache.


Solution

  • I misunderstood what Shake caches are for: I assumed they cache build artifacts indexed by all their inputs, but actually they just cache reading + processing of files used to decide things while producing those artifacts. Like in this example from the docs:

    digits <- newCache $ \file -> do
        src <- readFile' file
        return $ length $ filter isDigit src
    "*.digits" %> \x -> do
        v1 <- digits (dropExtension x)
        v2 <- digits (dropExtension x)
        writeFile' x $ show (v1,v2)
    

    As it says, "This function is useful when creating files that store intermediate values, to avoid the overhead of repeatedly reading from disk, particularly if the file requires expensive parsing." I think the cache here is equivalent to writing an intermediate .digits file and needing it twice.

    I implemented what I was looking for separately by writing a need' function that checks if a file is in the cache (local or remote), fetches it if possible, and then calls Development.Shake.need. Depending on the build system, it might also be possible to do this kind of "caching" with something like rsync.

    The Shake cache still looks very useful for the part of my code that needs to read a list of all sequence IDs into memory once per program run and refer to them later from various rules.