gitversion-controldiffgit-diffgit-difftool

In Git, how to diff Microsoft Word documents?


I've been following this guide here on how to diff Microsoft Word documents, but I ran into this error:

Usage:  /usr/bin/docx2txt.pl [infile.docx|-|-h] [outfile.txt|-]
        /usr/bin/docx2txt.pl < infile.docx
        /usr/bin/docx2txt.pl < infile.docx > outfile.txt

        In second usage, output is dumped on STDOUT.

        Use '-h' as the first argument to get this usage information.

        Use '-' as the infile name to read the docx file from STDIN.

        Use '-' as the outfile name to dump the text on STDOUT.
        Output is saved in infile.txt if second argument is omitted.

Note:   infile.docx can also be a directory name holding the unzipped content
        of concerned .docx file.

fatal: unable to read files to diff

To explain how I came to that error: I created a .gitattributes in the repository I want to diff from. .gitattributes looks like this:

*.docx diff=word
*.docx difftool=word

I've installed docx2txt. I'm on Linux. I've created a file called docx2txt which contains this:

#!/bin/bash
docx2txt.pl $1 -

I $ chmod a+x docx2txt and I put docx2txt in /usr/bin/.

I did:

$ git config diff.word.textconv docx2txt

Then I tried to diff two Microsoft Word documents. That's when I got the error I mentioned above.

What am I missing? How do I resolve this error?

PS: I don't know if my shell can find docx2txt because when I do this:

$ docx2txt

my terminal freezes, processing something, but doesn't output anything, and when I do these commands this happens:

$ man docx2txt
No manual entry for docx2txt
$ docx2txt --help
Can't read docx file <--help>!

UPDATE on progress: I changed docx2txt to

#!/bin/bash
docx2txt.pl "$1" -

as pmod suggested, and now git diff <commit> works from the command line! Yay!

However, when I try

$ git difftool <commit>

Git launches kdiff3 and, I get this pop-up error:

Some input characters could not be converted to valid unicode.
You might be using the wrong codec. (e.g. UTF-8 for non UTF-8 files).
Don't save the result if unsure. Continue at your own risk.
Affected input files are in A, B.

...and all of the characters in the files are mumbo jumbo. The command line displays the diff text correctly, but kdiff3 does not display the text from the diff correctly for some reason.

How do I display the text for the diff correctly in kdiff3 or another GUI tool? Should I change kdiff3 to another tool?

Extra: My shell doesn't seem to be able to find docx2txt, because of these commands:

$ which doctxt
which: no doctxt in (/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl)

$ which docx2txt
/usr/bin/docx2txt

Solution

  • doc2txt.pl expects exactly two arguments or zero according to usage. In the first (your) case arguments either filenames or "-". So, your wrapper script looks correct expect for the case when there is at least one space in filename passed as first argument. In this case, after expansion of $1 filename parts will be passed as separate arguments, thus tool outputs usage info because it reads more than 2 arguments.

    Try using quotes to avoid filename splitting:

    #!/bin/bash
    docx2txt.pl "$1" -
    

    PS: I don't know if my shell can find docx2txt

    You can check this with

    $ which docx2txt
    

    If you see the path, then tool (binary or runnable script) can be found (based on PATH environment variable).

    because when I do this:

    $ docx2txt

    my terminal freezes, processing something, but doesn't output anything

    Without arguments your script will execute doc2txt.pl - which according to tool's usage expects input file passed through STDIN, i.e. what you're typing. Thus, it looks like hanging and processing something, but actually only captures your input.