Can I make git recognize a UTF-16 file as text?
Asked Answered
S

10

171

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.

Schutzstaffel answered 22/4, 2009 at 15:51 Comment(0)
W
91

I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.

Waldos answered 19/8, 2009 at 15:55 Comment(6)
Not a perfect solution (would rather have a scrolling unified diff), BUT, it is the lesser evil given the choices and my unwillingness to find something new to install. "vimdiff", it is! (yea, vim ... and git)Embus
Does this also work to stage and commit only chunks of UTF16 files?Surd
I use Beyond Compare as a diff and merge tool. From .gitconfig <pre><code> [difftool "bc3"] path = c:/Program Files (x86)/Beyond Compare 3/bcomp.exe [mergetool "bc3"] path = c:/Program Files (x86)/Beyond Compare 3/bcomp.exe </code></pre>Bar
@Tom Wilson Sorry unable to format code block by indenting 4 spaces!?Bar
I have basic knowledge for git and not sure how it handles file changes. Is this always as binary files or for text (ASCII) there is special processing / detection of changes?Malo
I use gvimdiff rather than vimdiff. Vim without GUI looks terrible on windows.Popeyed
E
83

There is a very simple solution that works out of the box on Unices.

For example, with Apple's .strings files just:

  1. Create a .gitattributes file in the root of your repository with:

     *.strings diff=localizablestrings
    
  2. Add the following to your ~/.gitconfig file:

     [diff "localizablestrings"]
     textconv = "iconv -f utf-16 -t utf-8"
    

Source: Diff .strings files in Git (and older post from 2010).

Eyewitness answered 9/1, 2014 at 12:42 Comment(7)
I did this but git refuses to run after this. The error I get is "bad config file line 4 in /Users/myusername/.gitconfig". I used "git config --global --edit" to open my gitconfig file. Interestingly if I remove the added lines all works fine. Any clues ?Trutko
I am going to guess the smart quotes if you copy/pasted. I edited the answer to fix that.Mainz
This works like a charm, it should be the accepted answer for the sake of simplicity and for a better integration. I don't see how "use another tool" can be the answer to "Can I make git recognize a UTF-16 file as text?"Hyetology
@Hyetology Strictly, iconv is "another tool" in just the same way as Vim or Beyond Compare is (not part of the git suite).Sheena
@AgiHammerthief sure after reading again I agree, dunno what I was thinking about. FWIW vimdiff and iconv are both already present on macOS so you don't need to bother wondering where to get them, and they do the jobHyetology
Thanks for this answer, it works perfectly. The thing I'm wondering is, is it somehow possible to have this change affect other developers of this repo? Yes, .gitattributes is committed, but the lines added to ~/.gitconfig are not. Could these lines be added to .gitattributes as well, or is there a better way to do this?Gazebo
One other virtue here is that related commands also work: git log -p file.unicode takes advantage of this.Storied
A
46

Have you tried setting your .gitattributes to treat it as a text file?

e.g.:

*.vmc diff

More details at http://www.git-scm.com/docs/gitattributes.html.

Andvari answered 22/4, 2009 at 16:42 Comment(2)
This works, but for correctness please be aware that this sets two attributes: set and diff...Motta
This solution is the only acceptable for me. As per @OK comment, the "set" is irrelevant here, just *.vmc diff , *.sql diff etc.. is needed to set the 'diff' attribute for the path specified. (I can't edit the answer). 2 caveats however : diffs are shown with a space between each character, and not possible to "stage hunk" or "discard hunk" for those problematic files.Diseuse
C
35

By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash
diff <(iconv -f utf-16 -t utf-8 "$1") <(iconv -f utf-16 -t utf-8 "$2")

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).

Calve answered 22/4, 2009 at 16:40 Comment(4)
Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.Schutzstaffel
GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).Calve
@jared-oberhaus - is there a way to trigger this script only for certain types of files (i.e. given certain extension)?Apologete
Consider to add a remark to your answer that this changes doesn't do anything automatically. The git add -- renormalize . command should be executed on existing repository.Bromeosin
F
20

git recently has begun to understand encodings such as utf16. See gitattributes docs, search for working-tree-encoding

[Make sure your man page matches since this is quite new!]

If (say) the file is UTF-16 without BOM on Windows machine then add to your .gitattributes file

*.vmc text working-tree-encoding=UTF-16LE eol=CRLF

If UTF-16 (with bom) on *nix make it:

*.vmc text working-tree-encoding=UTF-16-BOM eol=LF

(Replace *.vmc with *.whatever for whatever type files you need to handle)

See: Support working-tree-encoding "UTF-16LE-BOM".


Added later

Following @Hackslash, one may find that this is insufficient

 *.vmc text working-tree... 

To get nice text-diffs you need

 *.vmc diff working-tree...

Putting both works as well

 *.vmc text diff working-tree... 

But it's arguably

  • Redundant — eol=... implies text
  • Verbose — a large project could easily have dozens of different text file types

The Problem

Git has a macro-attribute binary which means -text -diff. The opposite +text +diff is not available built-in but git gives the tools (I think!) for synthesizing it

The solution

Git allows one to define new macro attributes.

I'd propose that top of the .gitattributes file you have

 [attr]textfile text diff

Then for all paths that need to be text and diff do

 path textfile working-tree-encoding= eol=...

Note that in most cases we would want the default encoding (utf-8) and default eol (native) and so may be dropped.

Most lines should look like

*.c textfile
*.py textfile
Etc

Why not just use diff?

Practical: In most cases we want native eol. Which means no eol=... . So text won't get implied and needs to be put explicitly.

Conceptual: Text Vs binary is the fundamental distinction. eol, encoding, diff etc are just some aspects of it.

Disclaimer

Due to the bizarre times we are living in I don't have a machine with a current working git. So I'm unable at the moment to check the latest addition. If someone finds something wrong, I'll emend/remove.

Forbear answered 14/2, 2019 at 5:2 Comment(11)
To get my UTF-16LE-BOM file to work I had to use *.vmc diff working-tree-encoding=UTF-16LE-BOM eol=CRLFDensitometer
@Densitometer : Thanks for the heads-up. I guess you're saying with text alone you didn't get nice text diffs? Can you please check that with both text and diff everything works fine? In which case I'll make a different recommendationForbear
Correct, text alone results in binary compare. I can do diff or text diff and it works. I needed to add -BOM simply because my file had a BOM, YMMV.Densitometer
@Densitometer I've incorporated your finding. It would be great if you could check it out!Forbear
Thanks @Rusi, makes sense to me.Densitometer
This should be the correct answer. Only thing I would change is the -BOM. Gits own documentation says to use UTF-16 if you have Byte order marks. The only time they mention using BOM is if you are using UTF-16LE with byte order marks. you can find more info here: git-scm.com/docs/gitattributesDocumentation
Also had to use UTF-16LE-BOM. Try to open the file with Notepad++, it will tell you under the Encoding menu what type of UTF needs to be used. Also, don't forget to update your git, this is really a new feature that is not available on (e.g.) standard git from apt for Ubuntu 20.04.Arvizu
This is by far the best answer.Oatis
@Oatis Yeah... Unfortunately SE does not believe in flagging old hi vote answers that are obsoleted by the software changingForbear
@Oatis JFTR git's working-tree-encoding subsumes use of iconv see patchwork.kernel.org/project/git/patch/…Forbear
That makes sense, iconv is the standard conversion tool afterall! Thanks for the link.Oatis
J
8

Solution is to filter through cmd.exe /c "type %1". cmd's type builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).

Quoting from gitattributes man page:


Performing text diffs of binary files

Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly).

The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.

For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your $GIT_DIR/config file (or $HOME/.gitconfig file):

[diff "jpg"]
        textconv = exif

A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.

Step 1:

Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh:

#!/bin/bash
SED='s/\//\\\\\\\\/g'
FILE=\`echo $1 | sed -e "$SED"\`
cmd.exe /c "type $FILE"

Step 2:

Set up git to be able to use the script file. Inside your git config (~/.gitconfig or .git/config or see man git-config), put this:

[diff "cmdtype"]
textconv = c:/path/to/some/script.sh

Step 3:

Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)):

*vmc diff=cmdtype

then use git diff on your files.

Jezabel answered 9/7, 2009 at 3:48 Comment(3)
Almost as Tony Kuneck's but without "c:/path/to/some/script.sh" entropy.ch/blog/Developer/2010/04/15/…Daggna
I have some problem with the script as shown above with Git for Windows but I found the following is fine and also can deal with spaces in the path: cmd //c type "${1//\//\\}" .Schlenger
This will work without the need to create a script file: textconv = powershell -NoProfile -Command \"& {Get-Content \\$args[0]}\"Sheffield
M
4

I have written a small git-diff driver, to-utf8, which should make it easy to diff any non-ASCII/UTF-8 encoded files. You can install it using the instructions here: https://github.com/chaitanyagupta/gitutils#to-utf8 (the to-utf8 script is available in the same repo).

Note that this script requires both file and iconv commands to be available on the system.

Millicent answered 2/4, 2013 at 8:37 Comment(0)
Z
3

Had this problem on Windows recently, and the dos2unixand unix2dos bins that ship with git for windows did the trick. By default they're located in C:\Program Files\Git\usr\bin\. Observe this will only work if your file doesn't need to be UTF-16. For example, someone accidently encoded a python file as UTF-16 when it didn't need to be (in my case).

PS C:\Users\xxx> dos2unix my_file.py
dos2unix: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 Unix format...

and

PS C:\Users\xxx> unix2dos my_file.py
unix2dos: converting UTF-16LE file my_file.py to ANSI_X3.4-1968 DOS format...
Zannini answered 24/7, 2018 at 15:46 Comment(0)
C
2

As described in other answers git diff doesn't handle UTF-16 files as text and this makes them unviewable in Atlassian SourceTree for example. If the file name/or suffix is known the fix below will make those files viewable and comparable normally under SourceTree.

If the file suffix of the UTF-16 files is known (*.uni for example) then all files with that suffix can be associated with UTF-16 to UTF-8 converter with the following two changes:

  1. Create or modify the .gitattributes file in the root directory of the repository with the following line:

     *.uni diff=utf16
    
  2. Then modify the .gitconfig file in the users home directory (C:\Users\yourusername\.gitconfig) with the following section:

    [diff=utf16]
        textconv = "iconv -f utf-16 -t utf-8"
    

These two changes should take effect immediately without reloading the repository into SourceTree. It applies the text conversion to all *.uni files which makes them viewable and comparable like other text files. If other files need this conversion you can add additional lines to the .gitattributes file. (If the designated file(s) are NOT UTF-16 you will get unreadable results for that file.)

Note that this answer is a simplified rewrite of Tony Kuneck's answer.

Carmelacarmelia answered 29/3, 2021 at 14:51 Comment(0)
C
2

The git documentation on gitattributes gives a brief and nice explanation on the encoding topic -

Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, …​) as text files. Files encoded in certain other encodings (e.g. UTF-16) are interpreted as binary and consequently built-in Git text processing tools (e.g. git diff) as well as most Git web front ends do not visualize the contents of these files by default.

However, the working-tree-encoding attribute allows you to tell Git which files should be re-encoded (to UTF-8) before being stored in the repository. They are later "returned" to their original encoding when "copied" to the working directory.

Disclaimer - (Perhaps) Evertyhing here have been said in the other answers, and some even gave a lot more details on how to fix your issue. However, the quote I included made me realize how simple the answer of "Can Git handle encoding other than UTF-8?" is after browsing for it for hours...

Charmain answered 10/8, 2022 at 13:23 Comment(2)
Note that the working-tree-encoding feature was new in Git 2.18 (released April 2018); older versions of Git lack this very useful trick. (The original question and most answers predate it!)Maymaya
I guess what you are observing is that git chooses terminologies (actually models, ontologies) that are largely focussed around git's processes not the user's. So the very word/term working-tree-encoding is, I am sure, not something which any user is going to think of. A more user-oriented, layperson term would be file-type or at best file-encoding.Forbear

© 2022 - 2024 — McMap. All rights reserved.