Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8
Asked Answered
P

2

25

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

What step am I missing?

Pry answered 7/4, 2010 at 19:50 Comment(4)
Have you tried using a StreamWriter with UTF8 encoding to write the asciiString out to a text file? Does that do it?Marlin
@Task: His issue is that he's never getting the string out of 8859-1, not that he can't save it in UTF-8.Bohr
Oh, that's completely his problem, no question. I just find it easier to debug text conversion with a StreamReader/StreamWriter pair (so I can see the in/out files) rather than with an Encoding.Convert call. That might be just me.Marlin
@Task: I agree (hence my answer!) ;)Bohr
B
44

You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
    Encoding.UTF8, data);

It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open, 
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength < buffer.Length) 
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create, 
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.

Bohr answered 7/4, 2010 at 19:59 Comment(1)
thanks to all for the help and esp @Adam for his thorough answerPry
S
17

If the files are relatively small (say, ~10 megabytes), you'll only need two lines of code:

  string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
  System.IO.File.WriteAllText(outPath, txt);
Satanic answered 7/4, 2010 at 20:31 Comment(2)
Why does your solution only work when the file being read is less than 10 megabytes?Inferior
@Inferior ReadAllText uses a StreamReader which has a default buffer size of 1024 which you might want to tweak for larger files web.archive.org/web/20230801072915/https://github.com/microsoft/…Topminnow

© 2022 - 2024 — McMap. All rights reserved.