UTF32 and C# problems
Asked Answered
N

5

8

So I've got some troubles with character encoding. When I put the following two characters into a UTF32 encoded text file:

𩸕
鸕

and then run this code on them:

System.IO.StreamReader streamReader = 
    new System.IO.StreamReader("input", System.Text.Encoding.UTF32, false);
System.IO.StreamWriter streamWriter = 
    new System.IO.StreamWriter("output", false, System.Text.Encoding.UTF32);
    
streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

I get:

鸕
鸕

(same character twice, i.e the input file != output)

A few things that might help: Hex for the first character:

15 9E 02 00

And for the second:

15 9E 00 00

I am using gedit for the text file creation, mono for the C# and I'm using Ubuntu.

It also doesn't matter if I specify the encoding for the input or output file, it just doesn't like it if it's in UTF32 encoding. It works if the input file is in UTF-8 encoding.

The input file is as follows:

FF FE 00 00 15 9E 02 00 0A 00 00 00 15 9E 00 00 0A 00 00 00

Is it a bug, or is it just me?

Thanks!

Newel answered 3/4, 2012 at 5:44 Comment(11)
Encoding of output file?Inclose
Print out the result of streamReader.ReadToEnd().Ego
@Inclose - Changing it doesn't helpNewel
@Ego - It sure looks like the problem is in the reading: "鸕\n鸕"Newel
What have you done by way of debugging? For instance, try putting the result mof streamReader.ReadToEnd() into a string, and then check that. It should be the UTF-16 encoded version of the input.Martens
See 4th comment, that's exactly what I did. The problem is in the reading. If the file is saved in UTF8, and there is no encoding specified, the file is read and written correctlyNewel
How do you mean you get "鸕鸕"? Where are you reading this output?Lists
@Chibueze Opata - I'm reading it using the debugger, by assigning a variable to the value of streamReader.ReadToEnd().Newel
That means you're not reading the file correctly. The input encoding is not in the UTF-32 you specified, try to detect the encoding automatically instead. See my answer belowLists
@Newel if you use a hex editor to look at the input file, what values does it contain? (Just the first 16 will do.) It could be that the file is UTF-32 (LE) after all, but the StreamReader constructor mistakes the first two bytes of the BOM for UTF-16 (LE). That would be a horrible bug.Martens
@Mr Lister I have edited the question with the input file and some new, clearer code that directly specifies that the input is in UTF32, overriding whatever the preamble says. I find it strange that gedit will open input and save it, no problems, but my small annoying code just won't...Newel
N
7

K, so I figured it out I think, it seems to work now. Turns out, since the codes for the characters were 15 9E 02 00 and 15 9E 00 00, then there's no way that they can be held in one, single UTF-16 char. So, instead UTF16 uses these surrogate pairs things where there's two different characters that act as one 'element'. To get elements, we can use:

StringInfo.GetTextElementEnumerator(string fred);

and this returns a string with the surrogate pairs. Treat it as one character.

See here:

http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.gettextelement.aspx

Hope it helps someone :D

Newel answered 9/4, 2012 at 2:3 Comment(0)
L
1

I tried this and it works well on my PC.

System.IO.StreamReader streamReader = new System.IO.StreamReader("input", true);
System.IO.StreamWriter streamWriter = new System.IO.StreamWriter("output", false);

streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

Maybe the text you think is in UTF32 is not.

Lists answered 3/4, 2012 at 7:11 Comment(3)
Are you using Visual Studio/Windows? It might just be mono if not. I'll try other programs to make sure it is indeed UTF32, it certainly looks like it in a hex editor...Newel
Ok, good luck. But your code produced a wrong output as well on my PC.Lists
Oh, sorry I didn't notice the change in your code. In other news, using visual studio 2012 beta resulted in the correct output with my code...Newel
D
0

When writing you're not specifying UTF-32 so it defaults to Encoding.UTF8.

From MSDN:

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), so its GetPreamble method returns an empty byte array. To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).

Deforest answered 3/4, 2012 at 6:1 Comment(1)
That doesn't seem to be the problem. I've updated the question to help remove any confusion. Thanks anyway though!Newel
B
0

I think you need to specify the same encoding (Encoding.UTF32) also for your StreamWriter.

EDIT:

Normally it is not needed between UTF codepages but I would also try this:

Encoding utf8 = Encoding.UTF8;
Encoding utf32 = Encoding.UTF32;
byte[] utf8Bytes = utf8.GetBytes(yourText);
byte[] utf32Bytes = Encoding.Convert(utf8, utf32, utf8Bytes);
string utf32Text = utf32.GetString(utf32Bytes);
Bermuda answered 3/4, 2012 at 6:6 Comment(3)
I have :D, I just edited the question. Also it wouldn't really matter anyway, since any UTF-32 character can be expressed in UTF-8 or any Unicode encoding for that matter. AFAIK, anyway.Newel
@Newel I just read your updated answer and your comments. If you know what encoding is the read file and it is other than UTF32 then you have to read it in its original encoding and convert it to the own you want before writing it.Bermuda
Thanks for your help again. I tried your suggestion, but I couldn't get it working D:. Also, I thought the entire purpose of StringReaders and StringWriters was to convert between encodings. Maybe not then.Newel
T
0

From the Remarks section of MSDN for StreamReader's constructor:

This constructor initializes the encoding as specified by the encoding parameter, and the internal buffer size to 1024 bytes. The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

Very likely the byte order marks at the beginning of your file are actually indicating UTF 16 (or something), and so it's not using your explicitly stated UTF 32 encoding.

Teresitateressa answered 3/4, 2012 at 7:16 Comment(3)
Sure why not, I'll try using some other programs to ensure I'm getting the correct BOM.Newel
@Newel it looks like there's a constructor overload that will not look at the BOM by adding a boolean parameter, could try that if you don't have another program on hand to check.Teresitateressa
Right, I would have thought that specifying the encoding would have ensured it was used, obviously not then. I did, however, try using windows for this and it worked. But, I was not able to verify its UTF32 output since I don't have any windows programs that play well with UTF32, so I swapped it to output in UTF8.Newel

© 2022 - 2024 — McMap. All rights reserved.