C String encoding Windows/Linux
Asked Answered
A

1

6

If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:

Windows: strlen("ö") = 1
Linux:   strlen("ö") = 2

On a Windows machine the string is obviously encoded in the "extended" ascii format as 0xF6, whereas on a Linux machine it gets encoded in UTF-8 with 0xC3 0x96, which gives the length of 2 characters.

Question:

Why does a C string gets differently encoded on a Windows and a Linux machine?


The question came up in a discussion I had with a fellow forum member on Code Review (see this thread).

Afterward answered 24/12, 2016 at 1:39 Comment(7)
Are both source files using the same encoding and BOM setting?Triphibious
Looks like encoding might be picked up from the local settings. And seems like you can set it too? gcc.gnu.org/onlinedocs/cpp/Invocation.html (-fwide-exec-charset=charset)Sundowner
Because in Windows CP-1252 is the default, and there're some problem with UTF-8 when Microsoft wants to keep backward compatible. See this one on SU Windows 7 UTF-8 and UnicodeAlbany
Can you provide a reference where all C libraries have to use the same encoding for non-ASCII characters? Why is French different than English? And there is no "extended ASCII", but a zoo of mostly incompatible different character encodings which only have the first 128 codes in common.Nutriment
Since these are character literals in the source code, the number of char units needed depends on the encoding of the source file. (And, you have to always tell the compiler what the source encoding is or use what it deems the default.) So, it's not up to the system, it's up to you and your source code editor.Spelter
Your assumption that the character is encoded as F6 in all encodings on Windows is incorrect. This page lists many for which it is not true, some of which are used in Windows (IBM437 in particular).Spelter
@TomBlodget thanks, that was it! I am using Eclipse, in Preferences > General > Workspace it was set to Default: CP1252 on Windows and to UTF-8 on Linux. Maybe you want to reply with an answer or edit chux answer (if thats permitted)Afterward
P
5

Why does a C string gets differently encoded on a Windows and a Linux machine?

First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux).

This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions.

5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. C11dr §5.2.1 1 (My emphasis)

strlen("ö") = 1
strlen("ö") = 2

"ö" is encoded per the compiler's source character extended characters.

I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer.

As Unicode support grows, I expect that to be the solution of the future.

Posehn answered 24/12, 2016 at 3:36 Comment(3)
C and UTF-8, the nightmare. Hope that C will add UTF-8 support in the future. If C want continue to live, he must change. julialang.org/utf8proc is not easy to use.Outmost
@Outmost Agree about the night-mare. The issue is not so much of C adopting UTF-8 support - that's relatively easy - it exist since C11. (See 6.4.5 String literals like u8"Hellö"), but maintaining/depreciating prior extended character approaches side-by-side that are falling by the way-side. After all C, still has digraphs/trigraphs: a legacy solution to language related issues. It will take decades.Posehn
Thanks for the answer! I'm using gcc on both systems version 4.8.1 on windows and 4.8.4 on linux with the same options (-O0 -g3 -Wall -c -fmessage-length=0). I will play a bit with the options as suggested by Sush.Afterward

© 2022 - 2024 — McMap. All rights reserved.