What encoding are filenames in NTFS stored as?
Asked Answered
M

3

53

I'm just getting started on some programming to handle filenames with non-english names on a WinXP system. I've done some recommended reading on unicode and I think I get the basic idea, but some parts are still not very clear to me.

Specifically, what encoding (UTF-8, UTF-16LE/BE) are the file names (not the content, but the actual name of the file) stored in NTFS? Is it possible to open any file using fopen(), which takes a char*, or do I have no choice but to use wfopen(), which uses a wchar_t*, and presumably takes a UTF-16 string?

I tried manually feeding in a UTF-8 encoded string to fopen(), eg.

unsigned char filename[] = {0xEA, 0xB0, 0x80, 0x2E, 0x74, 0x78, 0x74, 0x0}; // 가.txt

FILE* f = fopen((char*)filename, "wb+");

but this came out as 'ê°€.txt'.

I was under the impression (which may be wrong) that a UTF8-encoded string would suffice in opening any filename under Windows, because I seem to vaguely remember some Windows application passing around (char*), not (wchar_t*), and having no problems.

Can anyone shed some light on this?

Mettlesome answered 12/1, 2010 at 17:33 Comment(2)
PHP's beavior has changed from PHP 7.1 on, see https://mcmap.net/q/162350/-how-do-i-use-filesystem-functions-in-php-using-utf-8-stringsEmirate
"I was under the impression (which may be wrong) that a UTF8-encoded string would suffice in opening any filename under Windows" - Windows DOES NOT support UTF-8 encoded filenames, only UTF-16 and ANSI (which gets converted to UTF-16 internally). UTF-8 filenames that contain only ASCII characters will work as ANSI strings, though. "I seem to vaguely remember some Windows application passing around (char), not (wchar_t), and having no problems" - char* does not imply UTF-8, but can be used for it. No standard Win32 or C/C++ file APIs accept UTF-8 as input, but 3rd party libraries mayRotary
S
42

NTFS stores filenames in UTF-16, however fopen is using ANSI (not UTF-8).

In order to use an UTF16-encoded file name you will need to use the Unicode versions of the file open calls. Do this by defining UNICODE and _UNICODE in your project. Then use the CreateFile call or the wfopen call.

Solomonsolon answered 12/1, 2010 at 17:38 Comment(4)
If changing the project to build with UNICODE defined is too large of a change, you can call wfopen() or CreateFileW() in a non-unicode build.Luna
Given that Windows NT and NTFS are older than the UTF-16 standard, is it possible that the older UCS-2 is used instead?Carine
NTFS allows any sequence of 16-bit values for name encoding except 0x0000. This means UTF-16 codepoints are supported, but the file system does not check whether a sequence is valid UTF-16. [source]Clevis
@Carine Win32 Unicode functions use wchar_t strings. NT and NTFS may predate UTF-16, but wchar_t can be used for both UCS-2 and UTF-16 on Windows, and Microsoft migrated away from UCS-2 to use UTF-16 in Win2K onwards.Rotary
H
15

fopen() - in MSVC on windows does not (by default) take a utf-8 encoded char*.

Unfortunately utf-8 was invented rather recently in the great scheme of things. Windows APIs are divided into Unicode and Ansi versions. every windows api that takes or deals with strings is actually available with a W or A suffix - W for "Wide" character/Unicode and A for Ansi. Macro magic hides all this away from the developer so you just call CreateFile with either a char* or a wchar_t* depending on your build configuration without knowing the difference.

The 'Ansi' encoding is actually not a specific encoding:- But means that the encoding used for "char" strings is specific to the locale setting of the PC.

Now, because c-runtime functions - like fopen - need to work by default without developer knowledge - on windows systems they expect to receive their strings in the windows local encoding. msdn indicates the microsoft c-runtime api setlocal can change the locale of the current thread - but specifically says that it will fail for any locales that need more than 2 bytes per character - like utf-8.

So, on Windows there is no shortcut. You need to use wfopen, or the native API CreateFileW (or create your project using the Unicode build settings and just call Createfile) with wchar_t* strings.

Huldahuldah answered 12/1, 2010 at 18:20 Comment(2)
Actually, there is a shortcut: you can convert the UTF-8 string to Unicode, create an ASCII-only "short pathname" using GetShortPathNameW, and pass that to fopen. This is the only way to pass non-ASCII filenames to legacy libraries (or those written in portable C) that just use fopen to open files.Adeleadelheid
"every windows api that takes or deals with strings is actually available with a W or A suffix - W for "Wide" character/Unicode and A for Ansi" - MOST functions, but not EVERY function. Functions that have existed for a long while, especially going back to the early days when Windows was ANSI-based, certainly do. But new functions introduced in recent years, and going forward, tend to only have Wide versions, and don't have the W suffix. Microsoft wants to phase out ANSI.Rotary
A
8

As answered by others, the best way to handle UTF-8-encoded strings is to convert them to UTF-16 and use native Unicode APIs such as _wfopen or CreateFileW.

However, this approach won't help when calling into libraries that use fopen() unconditionally because they do not support Unicode or because they are written in portable C. In that case it is still possible to make use of the legacy "short paths" to convert a UTF-8-encoded string into an ASCII form usable with fopen, but it requires some legwork:

  1. Convert the UTF-8 representation to UTF-16 using MultiByteToWideChar.

  2. Use GetShortPathNameW to obtain a "short path" which is ASCII-only. GetShortPathNameW will return it as a wide string with all-ASCII content, which you will need to trivially convert it to a narrow string by a lossless copy casting each wchar_t char.

  3. Pass the short path to fopen() or to the code that will eventually use fopen(). Be aware that error messages printed by that code, if any, will refer to the unsightly "short path" (e.g. KINTO~1 instead of kinto-un-筋斗雲).

While this is not exactly a recommended long-term strategy, as Windows short paths are a legacy feature that can be turned off per-volume, it is likely the only way to pass file names to code that uses fopen() and other file-related API calls (stat, access, ANSI versions of CreateFile and similar).

Adeleadelheid answered 7/11, 2014 at 14:22 Comment(5)
Gorgeous, you saved us, THANK YOU!!Schoolbook
"to handle UTF-8-encoded strings ... convert them to Unicode " UTF-8 (and UTF-16) are Unicode encodings. I guess you meant convert to UTF-16Presidio
@Presidio Yes, I meant Unicode as defined by Windows. Point #1 makes it clear that the UTF-16 encoding is needed. I've now amended the answer to refer to UTF-16 from the start.Adeleadelheid
The shortpath solution only works for reading files, not for writing, right?Eared
@Eared This strategy can be adopted to writing as well. Just create an empty file with the desired name using open(name, 'w').close() and then proceed with the recipe.Adeleadelheid

© 2022 - 2024 — McMap. All rights reserved.