git, msysgit, accents, utf-8, the definitive answers
Asked Answered
A

1

48

I've read in some places that there are problems with git (or just msysgit?) and character encoding - I believe it's only a problem in file names.

What I'd like is some 'definitive' (or at least authoritative) information about:

  1. What exactly are the 'problems'? (The symptoms)
  2. What are the causes? (Briefly)
  3. In what scenarios is this a show stopper?
  4. Is there any resolution in sight, or failing that any workarounds?

I hope this question isn't too vague, I think it would be good to have all of this information in one place to be able to point people to it...

Aron answered 2/5, 2011 at 8:24 Comment(1)
UTF-8 is commingfor msysgit. See my updated answer.Extirpate
E
41

Update Oct. 2023: with Git 2.43 (Q4 2023), the display width table for unicode characters has been updated for Unicode 15.1

See commit 872976c (25 Sep 2023) by Beat Bolli (bbolli).
(Merged by Junio C Hamano -- gitster -- in commit 64b2419, 04 Oct 2023)

unicode: update the width tables to Unicode 15.1

Signed-off-by: Beat Bolli

Unicode 15.1 has been announced on 2023-09-12, so update the character width tables to the new version.


Update Apr. 2023: with Git 2.41 (Q2 2023), the Unicode character width table (used for output alignment) has been updated.

See commit b10cbda (30 Mar 2023) by Beat Bolli (bbolli).
(Merged by Junio C Hamano -- gitster -- in commit 5ae4bd1, 31 Mar 2023)

unicode: update the width tables to Unicode 15

Signed-off-by: Beat Bolli

Unicode version 15 was released in September 2022, and we have so far neglected to update our width tables.
Do this now.


Update Oct. 2021: With Git 2.34 (Q4 2021), the Unicode character width table (used for output alignment) has been updated.

See commit 187fc8b (17 Sep 2021) by Carlo Marcelo Arenas Belón (carenas).
(Merged by Junio C Hamano -- gitster -- in commit 3d875f9, 28 Sep 2021)

unicode: update the width tables to Unicode 14

Update Feb. 2017 (Git 2.12): The character width table has been updated to match Unicode 9.0.
The update_unicode.sh is moved it into contrib/update-unicode: see its README.

Update August 2014 (git 2.1): commit a67c821 (Torsten Bögershausen (tboegi)) adds support for Unicode 7.0.

Update April 2014: commit d813ab9 (Torsten Bögershausen (tboegi)) adds support for Unicode 6.3
(git 1.9.2):

Unicode 6.3 defines more code points as combining or accents.
For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above).
We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them.

Affected codepoints are:

U+0358..U+035C
U+0487
U+05A2, U+05BA, U+05C5, U+05C7
U+0604, U+0616..U+061A, U+0659..U+065F

Earlier unicode standards had defined these as "reserved".

Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed.


Update April 2012: Unicode support is released in version 1.7.10. See this page for notes and settings you should set.

Namely:

git config [--global] core.quotepath off
git config [--global] i18n.logoutputencoding utf8
git config [--global] i18n.commitencoding utf8
git config [--global] --unset svn.pathnameencoding

The recodetree check command scans the entire history of a git repository and prints all non-ASCII file names. If the output is empty, no migration is necessary.


Update February 2012: patches for UTF-8 supports are comming in branch 'devel' of msysgit repo on GitHub, including Update less settings for UTF-8 .

The Git for Windows Google+ page mentions:

Karsten Blees' UTF-8 patches for Git for Windows has now been merged to 'devel'.
This means the upcoming release will support Unicode filenames!


May 2011

I believe the msysgit issue 80 has the latest on that bug.
Also described in issue 376.

For example:

This is what happens:

  1. git on Windows operates on file names and treats them essentially as byte streams. In your case, the streams happen to be UTF8 encoded text.

  2. git on Windows asks the runtime to create a file, and passes it the byte stream.

  3. Since internally on Windows everything is Unicode, the runtime converts the byte stream to UTF16 using the currently set locale (aka "codepage").
    That is, it effectively interprets the byte stream as CP949 (Korean) encoded text.
    Apparently, some of the UTF8 byte sequences are invalid CP949 sequences, and the conversion fails ("Invalid argument"); or if the UTF8 sequences happen to be correct CP949 sequences, the result is (most likely) a different character.

The true fix should be on MingW though:

It occurs to me that one solution would be this: solve it at the GCC C run-time library level.
That is, for the mingw GCC run-time library on Windows, make it possible via build-time options to be in a mode where the command-line parameters (passed to main()) and file I/O functions use the underlying Windows Unicode API calls, and translate to/from UTF-8 encoding in C's standard function APIs that use byte-strings.
That would "just work" for git perhaps, and could be useful for other Linux-originated open source projects running the Windows environment.

ak2 comments that MingW isn't the right place for this fix:

"MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes.
MinGW, being Minimalist, does not, and never will, attempt to provide a POSIX runtime environment for POSIX application deployment on MS-Windows.
If you want POSIX application deployment on this platform, please consider Cygwin instead."

There is some work in progress on a msysgit variant to support unicode.

Extirpate answered 2/5, 2011 at 8:52 Comment(10)
So in the meantime, is it XOR(Windows, git) AND accents in file names?Aron
Note besides: those issues have been solved in Cygwin 1.7 and hence also Cygwin git: it correctly translates between UTF-8 (or any other selected character set) and Windows' UTF-16 filename encoding.Crossrefer
@ak2: true, but msysgit isn't based on cygwin... @Benjol: path and filename shouldn't have any special chars for the moment.Extirpate
@ak2, @VonC, so if I understand correctly, this is not a problem of git+windows, but specifically msysgit+windows?Aron
@Benjol: yes, and more specifically, MingW, the unix-layer on which msysgit is based.Extirpate
@Extirpate No, it's not a MinGW issue. MinGW is just a GNU compiler for Windows, and that means it uses the standard Windows C runtime library (msvcrt.dll). It has no pretensions to be a POSIX compatiblity layer like Cygwin. It's up to msysGit itself to do the necessary conversions.Crossrefer
@ak2: interesting, although that isn't what I understood reading the issue 80.Extirpate
@VonC: From the horse's mouth: "MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. MinGW, being Minimalist, does not, and never will, attempt to provide a POSIX runtime environment for POSIX application deployment on MS-Windows. If you want POSIX application deployment on this platform, please consider Cygwin instead."Crossrefer
I still think it would be a great step forward for MinGW to use the Windows Unicode APIs, and translate to/from UTF-8. It would bypass the complexities and limitations of locale, and allow programs to be written simply for Unicode.Opponent
Shouldn't you simply remove the old updates and leave them only in the post's history, for cleaner read?Mizzen

© 2022 - 2024 — McMap. All rights reserved.