gitshellshlf

/bin/sh^M: bad interpreter: No such file or directory error caused by different GIT env configs overriding each other?


A build script I wrote is failing on a ci/cd pipeline (that runs in linux) because somehow the build.sh script got converted/save in CRLF format (based on what i gather online), leading to this error:

/bin/sh^M: bad interpreter: No such file or directory

The script itself is very basic:

#!/bin/sh
mvn clean install

I want to confirm that the cause is by git due to what i see when running git config. Below details the remedying actions I took:

  1. Saving in LF specifically on my IDE (selected line ending shown by Intellij with build.sh open):

file k

  1. Configuring git specifically not to mess with file line endings and converting to CLRF (i got this warning before), so I ran the following commands git config --global core.autocrlf false, git config --global core.eol lf and recloned the repository.

Here are my git configs (local, global and throughout the system)

Local config:

        core.bare=false
        core.logallrefupdates=true
        core.symlinks=false
        core.ignorecase=true
        core.autocrlf=false

Global config:

http.sslverify=false
core.autocrlf=false
core.eol=lf

Running git config --list --show-origin:

file:"C:\\ProgramData/Git/config"       core.symlinks=false
file:"C:\\ProgramData/Git/config"       core.autocrlf=true
file:"C:\\ProgramData/Git/config"       core.fscache=true
file:C:/Users/testUser/.gitconfig    http.sslverify=false
file:C:/Users/testUser/.gitconfig    core.autocrlf=false
file:C:/Users/testUser/.gitconfig    core.eol=lf
file:.git/config        core.logallrefupdates=true
file:.git/config        core.symlinks=false
file:.git/config        core.ignorecase=true
file:.git/config        core.autocrlf=false
file:.git/config        core.eol=lf

I've removed lines that have no relevance to this issue. As you can see in the overall config output, the output show that there's discrepancy in the configs. Could this be causing the issue of my shell script not running properly in other environments?


Solution

  • Here are a few simple rules, although some of them are opinions:

    In the .gitattributes file, list the .sh files in question, or *.sh, along with the directives text eol=lf. List any other files that need special consideration too, while you're at it: *.jpg can have a binary directive, if you have JPG images in the repository; *.bat can be marked text eol=crlf; and so on.

    This won't fix your existing problem; to do that, clone the repository, check out the bad commit at the tip of the current branch, modify the .sh file(s) to replace the existing CRLF line endings with LF-only line endings, and add and commit these files. (You can do this in the same commit in which you create the .gitattributes file.) If you have a reasonably modern Git, creating the .gitattributes file and then running git add --renormalize build.sh is supposed to do all of that (except the "create a new commit" step of course) in one fell swoop (or swell foop, if you're fond of Spoonerisms).

    What's going on here?

    Line-ending-fiddling in Git is an endless source of confusion. Part of the problem stems from the fact that people attempt to observe what's happening by inspecting the files in their working tree. That's akin to trying to figure out why the icemaker in your freezer isn't working by taking the trays out and putting them under extremely hot and bright lights, so that the plastic trays melt. If you do this, you are:

    That is, the problem is elsewhere, and by the time you get around to looking for it, it's long gone.

    To understand what's going on, and hence how and why anything that fixes the problem actually fixes the problem, you must first learn the Three Places Of Git where files can be found:

    That covers two places: so where is this "third place" I talk about? This is what Git calls, variously, the index, or the staging area, or—rarely these days—the cache. Git's index holds a third "copy" of every file. I put the word "copy" in quotes here because what's in the index is actually a sort of reference, using the de-duplication trick.

    Initially, when you first use git checkout or git switch to extract a particular commit from a repository you've just cloned, what Git does is:

    Note that before this step, Git's index was empty: it had no files in it at all. Now Git's index has every file from the current commit. These take no space, because they're de-duplicated and—having come out of a commit—they're all already in the repository so they are duplicates and therefore these copies use no space to hold the data.3

    So: what's the point of this index / staging-area / cache? Well, one point is that it makes Git go fast. Another is that it lets you partially stage files (though I won't cover what that means here). But in fact, it's not strictly necessary: other version control systems get away without having one. It's just that Git not only has it, Git forces you to use it. So you need to know about it, if only to know that it places itself between you and your files—in your working tree—and the commits in the repository.

    By omitting a few details that eventually matter, but not yet, we can describe the index pretty well as your proposed next commit. That is, the index holds each file that will go into the next commit. These files are in Git's own format—compressed and de-duplicated—but, unlike the files inside a commit, you can replace them. You can't modify them (they're in the read-only format, and pre-de-duplicated), but you can run git add.

    The git add command reads the working tree copy of some file. This working tree copy is the version you see and work with. If you've changed it, git add reads the changed version.4 The add command compresses this data down into Git's special internal format and checks to see if it's a duplicate. If it is a duplicate, Git throws out its compression result and re-uses the existing data, updating the index with the re-used file. If it's not a duplicate, Git saves the compressed and de-duplicated (but first time now) file data and updates the index with that.

    Either way, what's in the index now is the updated file. So the index now holds your proposed next commit. It held your proposed next commit before the git add too, but now your proposed next commit is updated. This tells us what the index is for from our point of view: The index holds your proposed next commit. You do not commit what is in your working tree. Instead, you commit what is in Git's index. This is why you need to know about the index: it's how Git makes new commits.


    1The commits themselves are only permanent until you or Git remove them, but in a lot of cases that's "never". It's actually kind of hard to get rid of a Git commit, for many reasons. A file's data as stored in a commit, de-duplicated, remains in the repository until every commit that holds that file is removed, though.

    2The actual file storage format inside computers is itself amazingly complicated and varied. Some systems do case-preserving but case-folding in file names, for instance, so that README.md and ReadMe.md are "the same file", while others say that these are two different files. Git holds the latter opinion, and when the commit archive holds both a README.md and a ReadMe.md, and you extract that commit to your working tree, one of those files goes missing from your working tree, since it's physically incapable of holding both (because they have the "same name" as far as your computer is concerned). Because Git's archived files are in a special Git-only format, this is not a problem for Git itself. But it can be a huge headache for you.

    3The other properties stored in the index—such as the cache aspect, which helps Git go fast—do take a bit of space. The average tends to somewhere close to 100 bytes per file, so unless you have a million files (which then needs ~100 MB of index), this is utterly trivial in modern systems where a chip the size of your fingernail provides 256 GB of storage.

    4If you haven't changed it, git add tries to skip reading it, to make Git go fast. This will cause us problems in a moment. So sometimes you may find it useful to trick Git into thinking you've changed it. You can do this by rewriting the file in place, or using the touch command if you have that, for instance. The --renormalize flag to git add is supposed to fix this as well, but I have seen people say it doesn't.


    How this relates to line endings

    Let's review quickly now:

    Note that there are steps here where Git does something very easy for Git: copying a commit into the index doesn't change any of the files at all, as they're still in the special read-only, Git-only format. Making a new commit doesn't change any of the files at all: they just get packaged up into a (read-only) commit, from the (replaceable but still read-only) "copies" in the index. But there are two steps where Git does something much harder:

    So it's copies in and out of the index where Git does CRLF line ending modification. Moreover, the "index -> working tree" step—which happens during git checkout, for instance—can only add CRs. It can't remove them. The "working tree -> index" step—which happens during git add, for instance—can only remove CRs, not add them.

    This in turn means that, if you choose to start doing line ending transformation, the committed files inside the repository will eventually end up with LF-only line endings, over time. If some committed files have CRLF line endings now, they will, in those commits, have those endings forever, because no existing commit can be changed.

    Optimizations that get in the way

    Now we get to some of the optimizations:

    Suppose you check out some commit, say, deadbeef. It has 5923 files in it. Those files get "copied" into the index, which is really fast because these aren't real copies. But were there files in the index before? Say you had commit dadc0ffee out just before you switched to deadbeef. That commit had put 5752 files in the index, and then all you did was look at the working tree copies.

    Obviously these files aren't all the same, but what if 5519 of the files were the same, leaving only 233 files to change and 171 new files to create. For whatever reason, there are no files in dadc0ffee that aren't in deadbeef, there are only new files. Or maybe one file does go away and Git will have to remove that one from the working tree and create 172 files. But either way, Git only needs to mess with 404 or 405 files in the working tree, not more than 5500. That's going to run about ten times faster.

    So, Git does that. If Git can, it doesn't touch files. It assumes that if file path/to/file.ext in the index in commit dadc0ffee has the same raw hash ID as file path/to/file.ext in the index in commit deadbeef, it does not have to do anything to the working tree copy.

    This assumption breaks down in the presence of CRLF line ending trickiness. If Git is supposed to do LF to CRLF modifications on the way out, but didn't for dadc0ffee, Git may skip doing it for deadbeef too.

    What this means is that whenever you change the CRLF line endings settings, you can end up with "wrong" line endings in your working tree. You can get around this by removing the working tree copy and then checking out the file again (with git restore or git reset --hard, for instance, though remember that git reset --hard loses uncommitted work!).

    Meanwhile, if you run git add on some file, and Git thinks that the cached index copy is up to date—because you haven't edited the working tree copy, for instance—Git will silently do nothing at all. But if the working tree copy has CRLF line endings, and the index (and hence future commit) copy shouldn't, this is wrong. Using git add --renormalize is supposed to get around it, or you can "touch" the file so that Git sees a newer working-tree time stamp and will redo the copy. Or, you can even run git rm --cached on the file, and then git add really does have to copy it, because there's no longer a copy of that file in the index at all.

    Summary: the reason for the "simple rules" above

    Using a .gitattributes file entry gives Git the most chance to get things right: Git can tell if the .gitattributes file entry affects some particular file. That gives Git the opportunity to do better cache checking, for instance. (Git currently doesn't use this opportunity properly, I think, but at least it offers the possibility.)

    When you do use .gitattributes entries, they tell Git multiple things:

    This lets you say that *.bat files need to be CRLF-ended in the working tree, even on a Linux system, and *.sh files need to be LF-ended in the working tree, even on a Windows system.

    You get as much control as Git is willing to give you:

    The one thing you lose is the easy and global effect of core.eol and core.autocrlf: these affect existing commits, and tell Git to guess whether each file is text. As long as Git guesses right, that tends to work sort-of-OK. It's when Git guesses wrong that things go really bad. But because these settings affect every file extraction (index-to-work-tree) and every git add (work-tree-to-index) that actually happens, and it's hard to know which ones happen, it's very hard to see what's going on.