In Git, how to diff Microsoft Word documents?
Asked Answered
R

2

10

I've been following this guide here on how to diff Microsoft Word documents, but I ran into this error:

Usage:  /usr/bin/docx2txt.pl [infile.docx|-|-h] [outfile.txt|-]
        /usr/bin/docx2txt.pl < infile.docx
        /usr/bin/docx2txt.pl < infile.docx > outfile.txt

        In second usage, output is dumped on STDOUT.

        Use '-h' as the first argument to get this usage information.

        Use '-' as the infile name to read the docx file from STDIN.

        Use '-' as the outfile name to dump the text on STDOUT.
        Output is saved in infile.txt if second argument is omitted.

Note:   infile.docx can also be a directory name holding the unzipped content
        of concerned .docx file.

fatal: unable to read files to diff

To explain how I came to that error: I created a .gitattributes in the repository I want to diff from. .gitattributes looks like this:

*.docx diff=word
*.docx difftool=word

I've installed docx2txt. I'm on Linux. I've created a file called docx2txt which contains this:

#!/bin/bash
docx2txt.pl $1 -

I $ chmod a+x docx2txt and I put docx2txt in /usr/bin/.

I did:

$ git config diff.word.textconv docx2txt

Then I tried to diff two Microsoft Word documents. That's when I got the error I mentioned above.

What am I missing? How do I resolve this error?

PS: I don't know if my shell can find docx2txt because when I do this:

$ docx2txt

my terminal freezes, processing something, but doesn't output anything, and when I do these commands this happens:

$ man docx2txt
No manual entry for docx2txt
$ docx2txt --help
Can't read docx file <--help>!

UPDATE on progress: I changed docx2txt to

#!/bin/bash
docx2txt.pl "$1" -

as pmod suggested, and now git diff <commit> works from the command line! Yay!

However, when I try

$ git difftool <commit>

Git launches kdiff3 and, I get this pop-up error:

Some input characters could not be converted to valid unicode.
You might be using the wrong codec. (e.g. UTF-8 for non UTF-8 files).
Don't save the result if unsure. Continue at your own risk.
Affected input files are in A, B.

...and all of the characters in the files are mumbo jumbo. The command line displays the diff text correctly, but kdiff3 does not display the text from the diff correctly for some reason.

How do I display the text for the diff correctly in kdiff3 or another GUI tool? Should I change kdiff3 to another tool?

Extra: My shell doesn't seem to be able to find docx2txt, because of these commands:

$ which doctxt
which: no doctxt in (/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl)

$ which docx2txt
/usr/bin/docx2txt
Relict answered 1/12, 2015 at 14:58 Comment(0)
H
4

doc2txt.pl expects exactly two arguments or zero according to usage. In the first (your) case arguments either filenames or "-". So, your wrapper script looks correct expect for the case when there is at least one space in filename passed as first argument. In this case, after expansion of $1 filename parts will be passed as separate arguments, thus tool outputs usage info because it reads more than 2 arguments.

Try using quotes to avoid filename splitting:

#!/bin/bash
docx2txt.pl "$1" -

PS: I don't know if my shell can find docx2txt

You can check this with

$ which docx2txt

If you see the path, then tool (binary or runnable script) can be found (based on PATH environment variable).

because when I do this:

$ docx2txt

my terminal freezes, processing something, but doesn't output anything

Without arguments your script will execute doc2txt.pl - which according to tool's usage expects input file passed through STDIN, i.e. what you're typing. Thus, it looks like hanging and processing something, but actually only captures your input.

Handicap answered 1/12, 2015 at 23:19 Comment(3)
I changed docx2txt as you recommended, and git diff works now! Thank you so much for this tip. It really helped me out, however git difftool <comment> throws an error I explain above in UPDATE, about the input characters not being converted to valid unicode. Don't understand it. Any idea how to fix it?Relict
@Relict thank you for accept, please use "$ which docx2txt" - it was typo in my answer. About new question/update - please create a new question because SE works on question->answer basis, and it will be easier for other to find solution.Handicap
@Relict ok, I posted answer in your another question, at least it has kdiff3 and "readable character" in title - so that's closer to the matterHandicap
O
4

You can use Pandoc to convert to Markdown:

pandoc -f docx -t markdown -o outfile.md infile.docx

and then use Meld, which has a great GUI, to compare the documents.

See also: https://askubuntu.com/questions/515900/how-to-compare-two-files

Oxalate answered 17/1, 2017 at 13:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.