C# UTF8 Reading/Outputting
Asked Answered
A

3

6

I'm trying to do something that I think should be fairly simple but I've spent way too much time on it already and I've tried several different approaches that I researched but to no avail.

Basically, I have a huge list of names that have "special" characters in them from the UTF8 charset.

My end goal is to read in each name, and then make an HTTP request using that name in the URL as a GET variable.

My first goal was to read in one name from a file, and put it to standard out to confirm I could read and write UTF8 properly, before creating the strings and make all the HTTP requests.

The test1.txt file I made contained just this contents:

Öwnägé

I then used this C# code to read in the file. I set the StreamReader encoding and the Console.OutputEncoding to UTF8.

static void Main(string[] args)
{
    Console.OutputEncoding = System.Text.Encoding.UTF8;

    using (StreamReader reader = new StreamReader("test1.txt",System.Text.Encoding.UTF8))
    {
        string line;

        while ((line = reader.ReadLine()) != null)
        {
            Console.WriteLine(line);
        }

    }

    Console.ReadLine();
}

Much to my surprise I get this kind of output:

enter image description here

Expected output is the exact same as the original file contents.

How can I be certain that the strings I am going to build to make HTTP requests are going to be correct if I cannot even do a simple task as read/write UTF8 strings?

Acreinch answered 6/3, 2012 at 15:13 Comment(0)
B
6

Your program is fine (assuming the input file is actually UTF-8). If you debug your program and use the Watch window to look at the strings (the line variable), you will find that it is correct. That is how you can be certain that you will send correct HTTP requests (or whatever else you do with the strings).

What you’re seeing is a bug in the Windows console.

Fortunately, it only affects raster fonts. If you change your console window to use a TrueType font, e.g. Consolas or Lucida Console, the problem goes away.

screenshot

You can set this for all future windows by using the “Defaults” menu item:

screenshot

Broaden answered 6/3, 2012 at 15:19 Comment(2)
+1 This is correct. Also be sure that you're saving your sample file using UTF-8 and not ANSI which is the default in Notepad.Dewdrop
This, in conjunction with Yuck's suggestion to make sure I selected UTF-8 instead of ANSI when saving the file worked out. Thanks guys you saved me a lot of headaches I'm sure!Acreinch
O
3

See Reading unicode from console

If you're using .NET 4 you will need to use

    Console.InputEncoding = Encoding.Unicode;
    Console.OutputEncoding = Encoding.Unicode;

and ensure you're using Lucida Console as the console font.

If you're using .NET 3.5 you're probably out of luck.

To efficiently read lines from a file I would probably use:

foreach(var line in File.ReadAllLines(path, Encoding.UTF8))
{
   // do stuff
}
Ovary answered 6/3, 2012 at 15:17 Comment(8)
What is the message in the exception?Ovary
The parameter is incorrect. And it's on the first line Console.InputEncoding = Encoding.Unicode;. Using .NET 4 as well.Dewdrop
My Target Framework is .NET Framework 3.5, I tried this anyways and received IOException that Yuck saw.Acreinch
Yes you will in .NET 3.5. It works fine for me in VS2010, .NET 4 client profile.Ovary
Ensure your project properties target framework is .NET 4 client profile or above.Ovary
@Ovary Both .NET 4 and .NET 4 Client Profile result in the same exception with the same message. I can't reproduce this as a solution.Dewdrop
@Ovary Same - Win 7 SP1, VS 2010, .NET 4Dewdrop
Ok, well that's odd, it works for me with Encoding.Unicode or Encoding.UTF8.Ovary
H
1

For reading all the characters like you mentions you Must use Default encoding like this

new StreamReader(@"E:\database.txt", System.Text.Encoding.Default))
Hetaera answered 2/2, 2013 at 7:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.