A build script I wrote is failing on a ci/cd pipeline (that runs in linux) because somehow the build.sh script got converted/save in CRLF format (based on what i gather online), leading to this error:
/bin/sh^M: bad interpreter: No such file or directory
The script itself is very basic:
#!/bin/sh
mvn clean install
I want to confirm that the cause is by git due to what i see when running git config. Below details the remedying actions I took:
git config --global core.autocrlf false
, git config --global core.eol lf
and recloned the repository.Here are my git configs (local, global and throughout the system)
Local config:
core.bare=false
core.logallrefupdates=true
core.symlinks=false
core.ignorecase=true
core.autocrlf=false
Global config:
http.sslverify=false
core.autocrlf=false
core.eol=lf
Running git config --list --show-origin
:
file:"C:\\ProgramData/Git/config" core.symlinks=false
file:"C:\\ProgramData/Git/config" core.autocrlf=true
file:"C:\\ProgramData/Git/config" core.fscache=true
file:C:/Users/testUser/.gitconfig http.sslverify=false
file:C:/Users/testUser/.gitconfig core.autocrlf=false
file:C:/Users/testUser/.gitconfig core.eol=lf
file:.git/config core.logallrefupdates=true
file:.git/config core.symlinks=false
file:.git/config core.ignorecase=true
file:.git/config core.autocrlf=false
file:.git/config core.eol=lf
I've removed lines that have no relevance to this issue. As you can see in the overall config output, the output show that there's discrepancy in the configs. Could this be causing the issue of my shell script not running properly in other environments?
Here are a few simple rules, although some of them are opinions:
core.eol
is not needed; don't bother with it.core.autocrlf
should always be false
.*.sh
files on a Windows system and thereby insert CRLF line endings into them, use .gitattributes
to correct this.In the .gitattributes
file, list the .sh
files in question, or *.sh
, along with the directives text eol=lf
. List any other files that need special consideration too, while you're at it: *.jpg
can have a binary
directive, if you have JPG images in the repository; *.bat
can be marked text eol=crlf
; and so on.
This won't fix your existing problem; to do that, clone the repository, check out the bad commit at the tip of the current branch, modify the .sh
file(s) to replace the existing CRLF line endings with LF-only line endings, and add and commit these files. (You can do this in the same commit in which you create the .gitattributes
file.) If you have a reasonably modern Git, creating the .gitattributes
file and then running git add --renormalize build.sh
is supposed to do all of that (except the "create a new commit" step of course) in one fell swoop (or swell foop, if you're fond of Spoonerisms).
Line-ending-fiddling in Git is an endless source of confusion. Part of the problem stems from the fact that people attempt to observe what's happening by inspecting the files in their working tree. That's akin to trying to figure out why the icemaker in your freezer isn't working by taking the trays out and putting them under extremely hot and bright lights, so that the plastic trays melt. If you do this, you are:
That is, the problem is elsewhere, and by the time you get around to looking for it, it's long gone.
To understand what's going on, and hence how and why anything that fixes the problem actually fixes the problem, you must first learn the Three Places Of Git where files can be found:
Files are stored, permanently1 and immutably, inside commits, in a special, read-only, Git-only, compressed and de-duplicated form. Each commit acts as an archive—kind of like a tar or zip archive—of every file as of the state that file had at the time you committed it.
Because of the special properties of these files, they literally cannot be used by your computer, except by Git itself. They must therefore be extracted, like un-archiving an archive with tar -x
or unzip
.
Files are stored in a usable form, as everyday files, in your working tree. This is where the extracted (unzipped, or whatever) files wind up. These files are not actually in Git at all. They are there for you to use as inputs and/or outputs, and your working tree is just an ordinary set of folders (or directories, whichever term you prefer) and files, stored in the way that is ordinary for your particular computer.2
That covers two places: so where is this "third place" I talk about? This is what Git calls, variously, the index, or the staging area, or—rarely these days—the cache. Git's index holds a third "copy" of every file. I put the word "copy" in quotes here because what's in the index is actually a sort of reference, using the de-duplication trick.
Initially, when you first use git checkout
or git switch
to extract a particular commit from a repository you've just cloned, what Git does is:
Note that before this step, Git's index was empty: it had no files in it at all. Now Git's index has every file from the current commit. These take no space, because they're de-duplicated and—having come out of a commit—they're all already in the repository so they are duplicates and therefore these copies use no space to hold the data.3
So: what's the point of this index / staging-area / cache? Well, one point is that it makes Git go fast. Another is that it lets you partially stage files (though I won't cover what that means here). But in fact, it's not strictly necessary: other version control systems get away without having one. It's just that Git not only has it, Git forces you to use it. So you need to know about it, if only to know that it places itself between you and your files—in your working tree—and the commits in the repository.
By omitting a few details that eventually matter, but not yet, we can describe the index pretty well as your proposed next commit. That is, the index holds each file that will go into the next commit. These files are in Git's own format—compressed and de-duplicated—but, unlike the files inside a commit, you can replace them. You can't modify them (they're in the read-only format, and pre-de-duplicated), but you can run git add
.
The git add
command reads the working tree copy of some file. This working tree copy is the version you see and work with. If you've changed it, git add
reads the changed version.4 The add
command compresses this data down into Git's special internal format and checks to see if it's a duplicate. If it is a duplicate, Git throws out its compression result and re-uses the existing data, updating the index with the re-used file. If it's not a duplicate, Git saves the compressed and de-duplicated (but first time now) file data and updates the index with that.
Either way, what's in the index now is the updated file. So the index now holds your proposed next commit. It held your proposed next commit before the git add
too, but now your proposed next commit is updated. This tells us what the index is for from our point of view: The index holds your proposed next commit. You do not commit what is in your working tree. Instead, you commit what is in Git's index. This is why you need to know about the index: it's how Git makes new commits.
1The commits themselves are only permanent until you or Git remove them, but in a lot of cases that's "never". It's actually kind of hard to get rid of a Git commit, for many reasons. A file's data as stored in a commit, de-duplicated, remains in the repository until every commit that holds that file is removed, though.
2The actual file storage format inside computers is itself amazingly complicated and varied. Some systems do case-preserving but case-folding in file names, for instance, so that README.md
and ReadMe.md
are "the same file", while others say that these are two different files. Git holds the latter opinion, and when the commit archive holds both a README.md
and a ReadMe.md
, and you extract that commit to your working tree, one of those files goes missing from your working tree, since it's physically incapable of holding both (because they have the "same name" as far as your computer is concerned). Because Git's archived files are in a special Git-only format, this is not a problem for Git itself. But it can be a huge headache for you.
3The other properties stored in the index—such as the cache aspect, which helps Git go fast—do take a bit of space. The average tends to somewhere close to 100 bytes per file, so unless you have a million files (which then needs ~100 MB of index), this is utterly trivial in modern systems where a chip the size of your fingernail provides 256 GB of storage.
4If you haven't changed it, git add
tries to skip reading it, to make Git go fast. This will cause us problems in a moment. So sometimes you may find it useful to trick Git into thinking you've changed it. You can do this by rewriting the file in place, or using the touch
command if you have that, for instance. The --renormalize
flag to git add
is supposed to fix this as well, but I have seen people say it doesn't.
Let's review quickly now:
Every commit contains files-as-a-snapshot, in a frozen (read-only), compressed, de-duplicated format. Nothing, not even Git itself, can ever change any part of any commit.
Git makes new commits from whatever is in Git's index. Git fills in the index from a commit when you check out the commit, and builds the new commit from whatever is in its index at the time you run git commit
.
Your working tree lets you see what came out of a commit: the files come out of the commit, go into Git's index, and then get copied and expanded to become ordinary files in your working tree. Your working tree lets you control what goes into a new commit: you run git add
and the data get compressed, de-duplicated, and generally Git-ified and put into the index, ready to be committed.
Note that there are steps here where Git does something very easy for Git: copying a commit into the index doesn't change any of the files at all, as they're still in the special read-only, Git-only format. Making a new commit doesn't change any of the files at all: they just get packaged up into a (read-only) commit, from the (replaceable but still read-only) "copies" in the index. But there are two steps where Git does something much harder:
As a file gets copied out of the index to your working tree, it gets expanded and transformed. Git has to change from compressed bytes to uncompressed bytes. This is an ideal time to change LF-only to CRLF and this is when Git will do that, if Git does it at all.
As a file gets copied from the working tree to be compressed and Git-ified and checked if it's a duplicate, Git has to change from uncompressed bytes to compressed ones. This is an ideal time to change CRLF to LF-only and this is when Git will do that, if Git does it at all.
So it's copies in and out of the index where Git does CRLF line ending modification. Moreover, the "index -> working tree" step—which happens during git checkout
, for instance—can only add CRs. It can't remove them. The "working tree -> index" step—which happens during git add
, for instance—can only remove CRs, not add them.
This in turn means that, if you choose to start doing line ending transformation, the committed files inside the repository will eventually end up with LF-only line endings, over time. If some committed files have CRLF line endings now, they will, in those commits, have those endings forever, because no existing commit can be changed.
Now we get to some of the optimizations:
When checking out a commit, Git tries hard not to touch the working tree if possible. This is slow! Let's not do it if we don't have to.
When using git add
, Git tries hard not to touch the index if possible. It's too slow!
Suppose you check out some commit, say, deadbeef
. It has 5923 files in it. Those files get "copied" into the index, which is really fast because these aren't real copies. But were there files in the index before? Say you had commit dadc0ffee
out just before you switched to deadbeef
. That commit had put 5752 files in the index, and then all you did was look at the working tree copies.
Obviously these files aren't all the same, but what if 5519 of the files were the same, leaving only 233 files to change and 171 new files to create. For whatever reason, there are no files in dadc0ffee
that aren't in deadbeef
, there are only new files. Or maybe one file does go away and Git will have to remove that one from the working tree and create 172 files. But either way, Git only needs to mess with 404 or 405 files in the working tree, not more than 5500. That's going to run about ten times faster.
So, Git does that. If Git can, it doesn't touch files. It assumes that if file path/to/file.ext
in the index in commit dadc0ffee
has the same raw hash ID as file path/to/file.ext
in the index in commit deadbeef
, it does not have to do anything to the working tree copy.
This assumption breaks down in the presence of CRLF line ending trickiness. If Git is supposed to do LF to CRLF modifications on the way out, but didn't for dadc0ffee
, Git may skip doing it for deadbeef
too.
What this means is that whenever you change the CRLF line endings settings, you can end up with "wrong" line endings in your working tree. You can get around this by removing the working tree copy and then checking out the file again (with git restore
or git reset --hard
, for instance, though remember that git reset --hard
loses uncommitted work!).
Meanwhile, if you run git add
on some file, and Git thinks that the cached index copy is up to date—because you haven't edited the working tree copy, for instance—Git will silently do nothing at all. But if the working tree copy has CRLF line endings, and the index (and hence future commit) copy shouldn't, this is wrong. Using git add --renormalize
is supposed to get around it, or you can "touch" the file so that Git sees a newer working-tree time stamp and will redo the copy. Or, you can even run git rm --cached
on the file, and then git add
really does have to copy it, because there's no longer a copy of that file in the index at all.
Using a .gitattributes
file entry gives Git the most chance to get things right: Git can tell if the .gitattributes
file entry affects some particular file. That gives Git the opportunity to do better cache checking, for instance. (Git currently doesn't use this opportunity properly, I think, but at least it offers the possibility.)
When you do use .gitattributes
entries, they tell Git multiple things:
This lets you say that *.bat
files need to be CRLF-ended in the working tree, even on a Linux system, and *.sh
files need to be LF-ended in the working tree, even on a Windows system.
You get as much control as Git is willing to give you:
The one thing you lose is the easy and global effect of core.eol
and core.autocrlf
: these affect existing commits, and tell Git to guess whether each file is text. As long as Git guesses right, that tends to work sort-of-OK. It's when Git guesses wrong that things go really bad. But because these settings affect every file extraction (index-to-work-tree) and every git add
(work-tree-to-index) that actually happens, and it's hard to know which ones happen, it's very hard to see what's going on.