gitutf-8encodewindows-1251

What going on? GIT see changes without changes


Git see changes just after clean cloning.

I just do clone project from server and one of my files already marked as changed.

nick@DESKTOP-NUMBER MINGW64 /d
$ git clone http://nick@host/nick/test.git
Cloning into 'test'...
remote: Enumerating objects: 27, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (22/22), done.
remote: Total 27 (delta 8), reused 0 (delta 0)
Unpacking objects: 100% (27/27), done.
error: failed to encode 'Var.not' from UTF-8 to Windows-1251

nick@DESKTOP-NUMBER MINGW64 /d
$ cd test/

nick@DESKTOP-NUMBER MINGW64 /d/test (master)
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   Var.not

no changes added to commit (use "git add" and/or "git commit -a")

error: failed to encode 'Var.not' from UTF-8 to Windows-1251 File can be opened in VSC in UTF-8 code with some unreadable symbols - normally it's Windows-1251. But what's a problem? Neighbour file "Var.yes" have the same text and the same codepage - no problem, no pseudo-changes.

How fix it?


Solution

  • This bit of information that you added in a comment is crucial:

    I just use .gitattributes with line: *.* text working-tree-encoding=Windows-1251

    The working-tree-encoding directive has quite a few side effects. See the gitattributes documentation for more details, but I'll quote one more bit from that page in a moment.

    This error message from your question above:

    error: failed to encode 'Var.not' from UTF-8 to Windows-1251
    

    suggests that the contents of this file are not actually stored as UTF-8 data.

    One of the pitfalls listed in the gitattributes documentation is:

    For example, Microsoft Visual Studio resources files (*.rc) or PowerShell script files (*.ps1) are sometimes encoded in UTF-16.

    Perhaps this is the case for your Var.not file.

    In any case:

    Am I wrong and working-tree-encoding edit and resave my files somehow?

    Yes, that is what working-tree-encoding does. To be totally precise, we need to talk about how Git stores files internally, and then extracts them to your work-tree so that you can use them, or copies them from your work-tree to internal format.

    Git internals: blob objects, or how files are frozen forever

    Git isn't really about files, but rather about commits. Each commit, once made, is (mostly) permanent and (completely) read-only / unchangeable. A commit holds files, though—or more precisely, has references to files—so by storing commits, Git effectively stores files.

    The form of the file, in storage, is important, though.

    Normally, Git just promises that a file is a bag of bytes. Whatever bytes you store in the file, Git will get them back for you. That's the case for raw data files—for files where you, in .gitattributes, say -text. It's the case for all files if you don't ask Git to muck with them, i.e., you don't mark them as text and set options like CRLF line endings or working-tree-encoding. But if you do ... well, first, let's get on with how the bag-of-bytes files work.

    Every commit stores a copy of every file—but with deduplication! Suppose you have a thousand commits, and each commit has a thousand files. This means you have Git storing one million versions of various files. But most of those versions of files are the same. That is, way back in your very first commit, you might have created a file you named README.md. You put some text in the file and put the file into your first commit.

    After that, you made another 99 commits using the same README.md. Then you changed it a bit and made the remaining 900 commits with the second version of README.md.

    The files in commits, like the commits themselves, are frozen for all time. So there's no need to make 1000 separate versions of README.md. We just need two versions: the first one, and the second one. The first 100 commits all share the first README.md. The last 900 commits all share the second one.

    In order to do this fast, and with space-savings, what Git does with a bag-of-bytes file is to compress it (with zlib deflate) and store that in what Git calls a blob object. This blob object gets a unique hash ID, just like each commit gets a unique hash ID. The hash ID of the first README.md is based on the data bytes in it. The hash ID of the second README.md is based on the data bytes in that second README.md. So there are only two blob objects, shared across all 1000 commits, with each commit referring to whichever object has the right frozen, compressed README.md contents.

    The upshot of all of this is that the file storage for each commit consists of these frozen, compressed blob objects. I like to call files in this form "freeze-dried": they're like freeze-dried coffee, to which you must add water. Rehydrating the freeze-dried files gets you the original contents—the original bag of bytes—back.

    Hence, to check out a commit, Git has to rehydrate all of its freeze-dried files. The commit holds the freeze-dried (and unmodifiable!) copies. The work-tree holds the regular-format files. We'll come back to this in a bit.

    Git internals: the index, A.K.A. staging area

    When you make a new commit, Git has to package up all of your files as new-or-re-used frozen blob objects. Other version control systems have done this by, for instance, re-freezing every file. This is pretty slow! Git, instead, does something clever.

    When you first check out some existing commit, Git doesn't just rehydrate its files. Git also stores references to the existing freeze-dried copies. This list of what files, in their freeze-dried copies, are in the current commit is in what Git calls, variously, the index, the staging area, or (rarely these days) the cache.

    In other words, the index lists all the blob hash IDs that went into extracting this commit into the work-tree.

    When you modify things in the work-tree, nothing happens to the index. You must run git add <file> on each of your modified files. This git add step copies the file from the work-tree. It re-compresses the bytes into the internal freeze-dried form. If necessary, this creates a new blob object on the spot. Now Git has the hash ID of a frozen-format, ready-to-commit file in the index.

    In other words, at all times, the index contains the next commit, ready to go. If you want an updated file to be updated in the next commit, you must run git add on it. This copies the file into the index, by way of looking up or creating an internal blob object, and once again, the index contains the next commit, ready to go.

    This is also why you have to keep running git add. Updating a work-tree file does not affect the index, and git commit makes new commits from whatever is in the index. If it's not in the index, it's not in the new commit. Whatever is in the index, that's what's in the new commit.

    Note that git status works by:

    1. Comparing the HEAD commit to the index. Whatever files are different, Git says staged for commit. When the two files are the same—when they are the same blob object—Git says nothing.

    2. Comparing the index to the work-tree. Whenever the work-tree file is different, Git says not staged for commit. When the two files would be the same (after appropriate rehydrating or freeze-drying), Git says nothing. (Note that there are two ways to compare: either freeze-and-compare-frozen, or rehydrate-and-compare-rehydrated. I think Git does the second of these, for various reasons, but the documentation makes no promises, so it could change without warning.)

    So the index, or staging area, is really what gets committed. Your work-tree only exists so that you can work with your files. Those files are never actually committed: what's committed is the freeze-dried stuff in the index.

    .gitattributes affects the freeze-drying and rehydrating process

    Note how, every time a file comes out of Git, it has to be rehydrated. Note how, every time a file goes into the index / staging-area, it has to be freeze-dried. These processes always fuss with the bag-of-bytes files, by compressing them with zlib deflate, or re-producing them with zlib inflate, as appropriate. The zlib deflate/inflate is a data-preserving operation: it never changes any of the bytes, in the end, after a round-trip (deflate + inflate).

    But because Git is already processing every byte of every file, this is the ideal place to change the bytes, too. For instance, suppose we want freeze-dried files to use line-feed endings always, but work-tree files on Windows to use CRLF line endings. We can tell Git:

    Because Git commits from the index (freeze-dried), not from the work-tree (rehydrated), this gets us just what we want. To do that, all we do is write:

    *.txt  text eol=crlf
    

    But we can have this do more than just LF/CRLF translations. In fact, using what Git calls clean and smudge filters, we can insert our own arbitrary operations. (That's how Git-LFS works.) Or, as in this particular case, we can set working-tree-encoding.

    Working-tree encoding affects the freeze-drying and rehydrating

    The working-tree-encoding setting tells Git:

    For this to work, the blob objects must actually be UTF-8. Moreover, this operation—UTF-8 to whatever, whatever-to-UTF-8—needs to be consistent: if it's not, every commit could have some random re-encoding into UTF8. This is the same round-trip idea as with deflate/inflate. But not all encodings make good guarantees here.

    For (much) more about the pitfalls—more than the gitattributes documentation mentions—see Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and then, e.g., this article on Unicode combining characters and normalization, which shows that two strings the look the same ("Zoë") may be spelled with different byte-sequences (combining umlaut and the letter E, or using a lowercase-E-with-umlaut Unicode character).

    In your case, the most likely problem is that the input file is not UTF-8 to start with (but it could be a re-encoding error of some sort).