C# UNICODE to ANSI conversion
Asked Answered
A

1

6

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...

I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.

So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.

I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.

/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
    Encoding windows = Encoding.Default;
    Encoding unicode = Encoding.Unicode;
    Encoding sp = Encoding.GetEncoding(codePage);
    if (sp != null && !String.IsNullOrEmpty(value))
    {
        // First get bytes in windows encoding
        byte[] wbytes = windows.GetBytes(value);

        // Check if CodePage to use is different from current Windows one
        if (windows.CodePage != sp.CodePage)
        {
            // Convert to Unicode using SP code page
            byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
        else
        {
            // Directly convert to Unicode using windows code page
            byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
    }
    else
    {
        return value;
    }
}

Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...

So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...

I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.

But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:

1st example (я)

char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);

string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));

So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\

2nd example (Σ)

char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);

string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));

At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why? Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!

So how can the Convert() function in the .NET framework manage this kind of equivalence?

And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

Aishaaisle answered 10/6, 2013 at 11:54 Comment(7)
You really do need to know the customer's code page before you can convert the text back to the customer's code page. If you don't have that information you won't be able to do it.Exceedingly
If you look at MSDN concerning windows-1252 code page for example (msdn.microsoft.com/en-us/goglobal/cc305145.aspx), there is at the bottom of this page a list of relations between ANSI 1252 codes and UNICODE code points .... So I thought there was an equivalence when going from UNICODE to one or more ANSI code pages ? Example is fileformat.info/info/unicode/char/3a3/charset_support.htm where there are all codes corresponding to sigma for all windows code pages ...Aishaaisle
It might be better to get your customers to work in UTF-8 or Unicode themselves. Do you control the software they use too?Avocado
Not at all, that's the problem :P We only provide interfaces with our application and theirs, which are often some old "homemade" (and non-UNICODE so) industrial softwares .... I guess I won't have posted this question if solution was to migrate customers to some industrial UTF8 applications ^^ I really need to ensure compatibility with their system by giving back ASCII 8bits file ...Aishaaisle
No such thing as 8-bit ASCII. So you HAVE to know which codepage to save to.Jesu
Hmmm ... This is a very good news for me if you are true :\ I really thought there was equivalences [0-n] from UNICODE to all the different ANSI code pages ... And what about Sigma which is transformed into 'S' in windows-1252 code page ? Does someone have an idea concerning this "implicit" conversion ?Aishaaisle
You should definitely ask all of your customers what code pages they use too to build up the list you'll need, and also ask if they'd be willing to use UTF-8 instead if you think you can get away with it. They can only say no!Avocado
B
7

I should have ...'?' if the character has not been found, but not 'S'. Why?

This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ëe), or mapping to a cognate (ΣS), a character that's related (=), a character that's unrelated but looks a bit similar (8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.

You can see the tables for cp1252, including that Sigma mapping, here.

Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.

does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.

(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)

Brewage answered 10/6, 2013 at 22:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.