I have an existing repository with files in mixed encodings - some files are in UTF-8 and some are in ANSI (e.g. Windows-1252). Mostly everything works fine, except that I am getting tired of seeing "invalid characters" when performing a diff on the ANSI files, and I am particularly annoyed that I can't use my GUI tool to stage or unstage hunks with these characters. I am looking for a way to convince Git that a certain file uses a non-UTF-8 encoding so that Git would perform the conversion first and then do its magic against that.
As far as I can tell, there are two ways of achieving the result:
[diff "win1252"]
textconv = "iconv -f windows-1252 -t utf-8"
.gitattributes
, mark the file as binary and request this filter to be used to convert it to text: *.txt diff=win1252
This approach seems to work fine in an isolated git diff
, but I have encountered several issues which I don't know how to solve:
core.autocrlf = true
, this approach will not perform CRLF conversion on the output of the conversion command, so my diff will show end-of-line differences in changed lines. I can create a script which would run iconv to perform the encoding conversion, then pass the output to dos2unix which would perform EOL conversion, but it seems rather heavy-handed.git add -p
shows garbage (even worse than the "unknown characters") and SourceTree stops the staging with an error message that it can't find the original text.While I might be able to learn to live with #1 and #2, #3 is a blocking problem because I mostly need this conversions done to facilitate staging of hunks with the "unknown characters" in them. My current workflow, where I use git add -p
without any conversions, might display "unknown characters", but at least it works.
A change to the GUI in impractical: all other GUIs I tried have much more serious problems than this.
.gitattributes
, mark the file as being a text file with a custom encoding: *.txt text working-tree-encoding=windows-1252
As far as I can tell, this approach cover all the complains listed above and works fine both on the command line and in the GUI. Unfortunately, there is a major caveat: It only works for files which were created after this attribute got set. With files created before I added this attribute, Git will display a change (from "unknown characters" to windows-1252) for every file which contains these unencoded characters. Also, after cloning the repository, it will complain that it "failed to encode 'a.txt' from UTF-8 to windows-1252". It seems the file actually got cloned correctly (byte-for-byte match against the original), but it still shows differences. Basically, I would have to commit every file with "unknown characters" to re-encode it to UTF-8 in the repository, which would cause hell with my history and pretty much make Blame unusable.
It seems that a realistic approach might be to use something like git filter-branch
, but for the whole repository (is there anything like that?) to convert all existing files to UTF-8 and add the attribute to the very first commit, but I am worried about doing something this massive. Also, I expect that I would lose the prior commit IDs, which would be unfortunate (I stamp my executables with commit IDs to easily locate the version from which they were built).
Is there any way of overcoming the drawbacks of the methods described, or is there another method which would not be vulnerable to them?
You're on the right track by using the working-tree-encoding
, but there's one more step you need.
In the same commit where you create the .gitattributes
file, run git add --renormalize .
, which will take all the working tree files and filter them according to the specified encoding. Then you'll want to commit all of the changed files and the .gitattributes
file in the same commit, and thereafter they'll be stored in the repo as UTF-8 but be Windows-1252 in your working tree.
This does have the downside that git blame
will have to jump back beyond that commit, but you can specify --ignore-rev
or --ignore-revs-file
(or the config option blame.ignoreRevsFile
) to ignore that revision, and everything will work.