Can't get git to play nice with iconv and utf-16
Asked Answered
J

3

2

I'm trying to get git to recognize UTF-16 as text to allow me to diff and patch as text natively, but I'm having trouble getting the textconv parameter to work.

I can manually call

iconv -f utf-16 -t utf-8 some-utf-16-file.rc

and everything is fine. But if I configure my .gitconfig as follows

[diff "utf16"]
    textconv = "iconv -f utf-16le -t utf-8"

and my .gitattributes:

# Custom for MFC
*.rc text eol=crlf diff=utf16

However, if I then if I run git diff, the following is displayed:

iconv: C:/Users/Mahmoud/AppData/Local/Temp/IjLBZ8_OemKey.rc:104:1: incomplete character or shift sequence

With procmon I was able to track it down as creating this process:

sh -c "iconv.exe -f utf-16le -t utf-8 \"$@\"" "iconv.exe -f utf-16le -t utf-8" C:/Users/Mahmoud/AppData/Local/Temp/JLOkVa_OemKey.rc

...which I can actually run fine (on the actual file, though).

Any ideas?

(Please note that I'm aware of the various solutions for getting git to work with UTF-16. I'm specifically trying to address this question of why iconv by itself works but it will not work when called by git. Also, this error was originally encountered while trying one of the linked solutions from the "duplicate" question. Thank you all kindly.)

Jostle answered 4/6, 2016 at 16:16 Comment(7)
Try this: Can I make git recognize a UTF-16 file as text?, or this: #3915572Dermatology
@Dermatology My question is actually specifically about getting git and iconv to work nice, not about getting git to work with UTF-16; but thanks!Jostle
Not sure about this - could it have to do with the iconv.exe being binary?Cupel
@Briana I don't think so, sysmon shows it is executed OK.Jostle
Remember - from the DOS/Windows "command line", there's a whole BUNCH of different actors involved: including Cygwin and Windows. Please read the links I cited: "GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters)." And please read the multiple different workarounds.Dermatology
Failing all else, you can always just do everything under Linux :)Dermatology
Could it be that git rewrites end-of-line characters (which messes utf-16 up) before giving that file to iconv?Manducate
M
4

Use only diff, it should work:

*.rc diff=utf16

text and eol cause git to substitute end-of-lines before passing data to iconv, after which it is not a valid utf16 anymore, as noted in comments.

Maxillary answered 7/6, 2016 at 9:35 Comment(1)
Thank you for both explaining the cause of the problem and providing a solution to continue using the attempted approach (but correctly). Too bad this can't work with interactive/patch git add, but there's no way that it could since there is no guaranteed one-to-one mapping between a filtered view and the source materials.Jostle
V
3

Git 2.21 (Feb. 2019) adds a new encoding UTF-16LE-BOM: invented to force encoding to UTF-16 with BOM in little endian byte order, which cannot be directly generated by using iconv.

See commit aab2a1a (30 Jan 2019) by Torsten Bögershausen (tboegi).
(Merged by Junio C Hamano -- gitster -- in commit 0fa3cc7, 07 Feb 2019)

Support working-tree-encoding "UTF-16LE-BOM"

Users who want UTF-16 files in the working tree set the .gitattributes like this:

test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 allowed ways how to encode UTF-16. The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16", in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
0000000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
0000000    g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
0000000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree, but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants, but in practice, we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE" as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency)

Vershen answered 3/3, 2019 at 20:50 Comment(0)
K
1

git recently has begun to understand encodings ie in effect iconv is now to some extent builtin. See gitattributes docs, search for working-tree-encoding

[Make sure your man page matches since this is quite new!]

If (say) the file is utf-16 without bom on windows machine then add to your gitattributes file

some-utf-16-file.rc text working-tree-encoding=UTF-16LE eol=CRLF

If utf-16 little endinan (with bom) on *nix make it

some-utf-16-file.rc text working-tree-encoding=UTF-16 eol=LF
Kenspeckle answered 14/2, 2019 at 4:54 Comment(1)
Thanks for this update! I’m guessing this will work with interactive patch add, so it is very welcome!Jostle

© 2022 - 2024 — McMap. All rights reserved.