makefilegnu-make

Run a rule but allow intervention


I'm writing a makefile.

I have an artifact data and I use gpt to process it. Due to token limits, I have to break the data into segments (rule segments), each processed by AI (rule ai), then assemble segments together (rule assemble).

segments : 
    break --size 20M ./data.txt seg-%d.txt

ai : segments
    for f in seg-*; do
        # outputfile is ai-output-%d
        gpt $f
    done

assemble : ai
    concat --output final.txt -- ai-output-*

This makefile works when I call make assemble.

However, gpt is not very stable. Sometimes I have to call it again to get a better output. Then, if I call make assemble, make will re-run rule ai, and overwrite my preferred output.

I would like to use this makefile in two cases:

  1. When the folder is in a clean state, I call make assemble, and final.txt is created.
  2. When the folder is not clean, I manually rerun chatgpt and get a different ai output. Then I call make assemble, and final.txt is created from my revised output.

How can I change my makefile to allow this two cases?


Solution

  • There are several issues with your Makefile:

    1. By default the recipes (the commands of your rules) are executed by the sh shell, and for sh, break is an already defined command that breaks the enclosing loop. Unless you instruct make to use a different shell (by assigning the SHELL special variable), your first rule will likely fail.

    2. By default each line of a recipe is executed by a different invocation of the shell. Unless you write its recipe on a single line your second rule will likely fail.

    3. Before passing a recipe to the shell make expands it. So, if you want to use shell variables, like in your second rule, you must protect their shell expansion from this first make expansion: use $$f, not $f. But as you will see below you don't need a shell for loop to generate the ai-output-% from the seg-%.txt; make pattern rules have been invented exactly for this kind of situation.

    4. make needs to know which files are generated by the rules in order to decide if they are up to date or not (by comparing the last modification times of target files and prerequisite files). If you hide this essential information from make, it cannot do its job properly, and it may rebuild what is up to date or not rebuild what is out of date.

    5. make runs in two phases. It builds the tree of dependencies during the first phase, and builds what needs to be during the second phase. If the tree of dependencies is modified during the second phase, because files are created/deleted, make cannot update its plans, it's too late.

    To solve all these issues, with GNU make1, you could try something like:

    .PHONY: segments ai assemble clean
    
    SEGS := $(wildcard seg-*.txt)
    AIS  := $(patsubst seg-%.txt,ai-output-%,$(SEGS))
    
    segments: .segs.done
    .segs.done: ./data.txt
        rm -f seg-*.txt
        my_break --size 20M $< seg-%d.txt
        touch $@
    
    ai: $(AIS)
    
    $(AIS): ai-output-%: seg-%.txt
        gpt $<
    
    assemble: .segs.done
        $(MAKE) final.txt
    
    final.txt: $(AIS)
        concat --output $@ -- $^
    
    clean:
        rm -f seg-*.txt .segs.done ai-output-* final.txt
    

    Explanations, in the same order as the above list of issues:

    1. We use my_break instead of break.

    2. We don't use multi-lines recipes. If we were we would write them on a single line (adding ;, &&, ||, |, etc. to join them when needed). We would maybe split the line if it is too long and use the line continuation (by adding a \ at the end of each line). Example:

      ai: segments
          for f in seg-*; do gpt $$f; done
      

      Or:

      ai: segments
          for f in seg-*; do \
              gpt $$f; \
          done
      

      But we don't need all this because we use a static pattern rule to tell make how to build the ai-output-N file from the seg-N.txt file:

      $(AIS): ai-output-%: seg-%.txt
          gpt $<
      

      A side benefit is that make can launch parallel jobs to build several ai-output-N files at a time (try make -j12 if you have 12 cores).

    3. We don't use shell variables in recipes. If we did we would use $$ instead of $ (at least), as in the above example.

    4. We use phony targets to offer targets that are synonyms of a group of similar targets (e.g., make ai to build all ai-output-N), but we explicitly tell make what files are produced by each non-phony rule.

      There is only one exception: the my_break rule for which we don't know which files are produced. To solve this problem we use a common trick: generate (or update) a dummy empty file (.seg.done) with a final touch to "remember" the last time we ran my_break. This way, by comparing the last modification times of data.txt and .seg.done, make knows if my_break should be run again.

      Note that the recipe first deletes all existing seg-N.txt to avoid keeping around out of date files.

      Note also that, during the first phase of make, we compute the list of existing seg-N.txt files with wildcard, and store it in make variable SEGS. We also compute the list of corresponding ai-output-N files with patsubst and store it in make variable AIS. And we use these in the rest to guarantee that we are accurate and don't incorporate extra out of date ai-output-N files in the building of final.txt.

    5. This one is probably the most difficult. To solve it we must first update the seg-N.txt files, just in case data.txt changed, and then restart make from scratch such that it discovers the new situation during its first phase. We thus call make from make (recursive make 2). When building assemble, make first updates the seg-N.txt files if needed (because .segs.done is a prerequisite) and then calls itself to finish the job, starting from a state where all seg-N.txt are present and up to date.

    If, between two make assemble, you run gpt again to improve some ai-output-N, make notices it. It does not rebuild any seg-N.txt or ai-output-N because it knows they are newer than their prerequisites, but it rebuilds final.txt because it is older than some of its prerequisites.

    There are some extra features like automatic variables ($<, $@, $^) or the use of $(MAKE) to call make. If needed you will find explanations in the GNU make manual.

    As noted in comments you may want to add the .DELETE_ON_ERROR special target somewhere to automatically delete targets when the recipe that builds them fails. Don't rely too much on it, however, because the my_break rule does not explicitly list the real seg-N.txt targets. So, if my_break fails, only .segs.done will be deleted.


    1 If your make is not GNU make there are probably a few things to adapt.

    2 You will maybe read here or elsewhere that "recursive make is considered harmful". Don't let these statements prevent you from using recursive make when it is absolutely needed. As many other features, recursive make can be harmful when it is wrongly used, but it is essential in some cases, like yours.