Convert String (UTF-16) to UTF-8 in C#
Asked Answered
S

7

19

I need to convert a string to UTF-8 in C#. I've already try many ways but none works as I wanted. I converted my string into a byte array and then to try to write it to an XML file (which encoding is UTF-8....) but either I got the same string (not encoded at all) either I got a list of byte which is useless.... Does someone face the same issue ?

Edit : This is some of the code I used :

str= "testé";
byte[] utf8Bytes = Encoding.UTF8.GetBytes(str);
return Encoding.UTF8.GetString(utf8Bytes);

The result is "testé" or I expected something like "testé"...

Singleton answered 1/6, 2011 at 9:8 Comment(3)
Your existing code would expain your problem better, and, what are you expecting to get, if not a list of bytes or a readable string? Surely in XML a readable string is exactly what you want?Spieler
Also, what do you mean when you say you "got the same string not encoded at all"? If you take a UTF-16 string, and save it to a UTF-8-encoded XML file, and then open the XML file in a text editor, you will see "the same string". You would only notice a difference if you open the file using a hex editor.Sportscast
Does this answer your question? UTF-16 to UTF-8 conversion (for scripting in Windows)Actor
K
20

If you want a UTF8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the followed:

public static string Utf16ToUtf8(string utf16String)
{
   /**************************************************************
    * Every .NET string will store text with the UTF16 encoding, *
    * known as Encoding.Unicode. Other encodings may exist as    *
    * Byte-Array or incorrectly stored with the UTF16 encoding.  *
    *                                                            *
    * UTF8 = 1 bytes per char                                    *
    *    ["100" for the ansi 'd']                                *
    *    ["206" and "186" for the russian 'κ']                   *
    *                                                            *
    * UTF16 = 2 bytes per char                                   *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["186, 3" for the russian 'κ']                          *
    *                                                            *
    * UTF8 inside UTF16                                          *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["206, 0" and "186, 0" for the russian 'κ']             *
    *                                                            *
    * We can use the convert encoding function to convert an     *
    * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
    * encoding to string method now, we will get a UTF16 string. *
    *                                                            *
    * So we imitate UTF16 by filling the second byte of a char   *
    * with a 0 byte (binary 0) while creating the string.        *
    **************************************************************/

    // Storage for the UTF8 string
    string utf8String = String.Empty;

    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Fill UTF8 bytes inside UTF8 string
    for (int i = 0; i < utf8Bytes.Length; i++)
    {
        // Because char always saves 2 bytes, fill char with 0
        byte[] utf8Container = new byte[2] { utf8Bytes[i], 0 };
        utf8String += BitConverter.ToChar(utf8Container, 0);
    }

    // Return UTF8
    return utf8String;
}

In my case the DLL request is a UTF8 string too, but unfortunately the UTF8 string must be interpreted with UTF16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 has to be converted to the UTF16 '–' which is 8211. If you have this case too, you can use the following instead:

public static string Utf16ToUtf8(string utf16String)
{
    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Return UTF8 bytes as ANSI string
    return Encoding.Default.GetString(utf8Bytes);
}

Or the Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 WideCharToMultiByte(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPWStr)] String lpWideCharStr, Int32 cchWideChar, [Out, MarshalAs(UnmanagedType.LPStr)] StringBuilder lpMultiByteStr, Int32 cbMultiByte, IntPtr lpDefaultChar, IntPtr lpUsedDefaultChar);

public static string Utf16ToUtf8(string utf16String)
{
    Int32 iNewDataLen = WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, utf16String.Length, null, 0, IntPtr.Zero, IntPtr.Zero);
    if (iNewDataLen > 1)
    {
        StringBuilder utf8String = new StringBuilder(iNewDataLen);
        WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, -1, utf8String, utf8String.Capacity, IntPtr.Zero, IntPtr.Zero);

        return utf8String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf8ToUtf16. Hope I could be of help.

Kinswoman answered 7/2, 2013 at 21:9 Comment(7)
There is no reason to call Encoding.Unicode.GetBytes() then Encoding.Convert, just call Encoding.UTF8.GetBytes() instead.Subtonic
The first two methods don't even work for me, as @Thomas Levesque said, "A string in C# is always UTF-16, there is no way to "convert" it". So, since I'm calling a native function that requires a wide string, and I need to get a decoded string back, your native method is what works for me.Segmental
@Segmental I am sorry to read that, it worked for me while calling a Delphi-DLL's Method, so I posted them here. I hope the native way does work for you and I could help you anyway.Kinswoman
Just to be sure: The string before converting will still be UTF-16, it just contains UTF-8 encoding data. You can't handle strings using the UTF-8 encoding, because .NET will always use the UTF-16 encoding to handle strings.Kinswoman
@MEN. Thank you very much. One question, every string I tested worked with the 2nd method (Encoding.Default). Will this be a problem when the user has an OS with a different default?Preciosity
@Lara late but not never. Yes it should work because you can find in the documentation at "learn.microsoft.com/en-us/dotnet/api/system.text.encoding" the line "For ANSI encodings, the best fit behavior is the default."Kinswoman
@MEN. Statement about default behavior is in the section about fallbacks. It says that you shouldn't use Encoding(Int32, EncoderFallback, DecoderFallback) constructor for ANSI encodings. Default property is NOT recommended for use link. Don't confuse othersDispirited
J
18

A string in C# is always UTF-16, there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...).

If you want to write the string to a XML file, just specify the encoding when you create the XmlWriter

Jiggle answered 1/6, 2011 at 9:12 Comment(2)
I've finally found a solution...I've tried all that you advised me but that didn't work for me... This piece of code worked for me : using (TextWriter writer = new StreamWriter(filename) { xmlDoc.Save(writer); }Singleton
@Celero, it's not exactly what I suggested but it's equivalent... StreamWriter uses UTF-8 by defaultJiggle
K
1
    private static string Utf16ToUtf8(string utf16String)
    {
        /**************************************************************
         * Every .NET string will store text with the UTF16 encoding, *
         * known as Encoding.Unicode. Other encodings may exist as    *
         * Byte-Array or incorrectly stored with the UTF16 encoding.  *
         *                                                            *
         * UTF8 = 1 bytes per char                                    *
         *    ["100" for the ansi 'd']                                *
         *    ["206" and "186" for the russian '?']                   *
         *                                                            *
         * UTF16 = 2 bytes per char                                   *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["186, 3" for the russian '?']                          *
         *                                                            *
         * UTF8 inside UTF16                                          *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["206, 0" and "186, 0" for the russian '?']             *
         *                                                            *
         * We can use the convert encoding function to convert an     *
         * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
         * encoding to string method now, we will get a UTF16 string. *
         *                                                            *
         * So we imitate UTF16 by filling the second byte of a char   *
         * with a 0 byte (binary 0) while creating the string.        *
         **************************************************************/

        // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
        byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
        byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
        char[] chars = (char[])Array.CreateInstance(typeof(char), utf8Bytes.Length);

        for (int i = 0; i < utf8Bytes.Length; i++)
        {
            chars[i] = BitConverter.ToChar(new byte[2] { utf8Bytes[i], 0 }, 0);
        }

        // Return UTF8
        return new String(chars);
    }

In the original post author concatenated strings. Every sting operation will result in string recreation in .Net. String is effectively a reference type. As a result, the function provided will be visibly slow. Don't do that. Use array of chars instead, write there directly and then convert result to string. In my case of processing 500 kb of text difference is almost 5 minutes.

Kisung answered 6/10, 2016 at 9:8 Comment(1)
actually, still there is an error. if utf16Bytes array actually contains 2 byte symbol, conversion will not work correctly, as we are just dropping the high byteKisung
C
0

Check the Jon Skeet answer to this other question: UTF-16 to UTF-8 conversion (for scripting in Windows)

It contains the source code that you need.

Hope it helps.

Cuprous answered 1/6, 2011 at 9:11 Comment(0)
W
0

does this example help ?

using System;
using System.IO;
using System.Text;

class Test
{
   public static void Main() 
   {        
    using (StreamWriter output = new StreamWriter("practice.txt")) 
    {
        // Create and write a string containing the symbol for Pi.
        string srcString = "Area = \u03A0r^2";

        // Convert the UTF-16 encoded source string to UTF-8 and ASCII.
        byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
        byte[] asciiString = Encoding.ASCII.GetBytes(srcString);

        // Write the UTF-8 and ASCII encoded byte arrays. 
        output.WriteLine("UTF-8  Bytes: {0}", BitConverter.ToString(utf8String));
        output.WriteLine("ASCII  Bytes: {0}", BitConverter.ToString(asciiString));


        // Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded  
        // string and write.
        output.WriteLine("UTF-8  Text : {0}", Encoding.UTF8.GetString(utf8String));
        output.WriteLine("ASCII  Text : {0}", Encoding.ASCII.GetString(asciiString));

        Console.WriteLine(Encoding.UTF8.GetString(utf8String));
        Console.WriteLine(Encoding.ASCII.GetString(asciiString));
    }
}

}

Wellmannered answered 1/6, 2011 at 9:12 Comment(0)
J
0
class Program
{
    static void Main(string[] args)
    {
        String unicodeString =
        "This Unicode string contains two characters " +
        "with codes outside the traditional ASCII code range, " +
        "Pi (\u03a0) and Sigma (\u03a3).";

        Console.WriteLine("Original string:");
        Console.WriteLine(unicodeString);
        UnicodeEncoding unicodeEncoding = new UnicodeEncoding();
        byte[] utf16Bytes = unicodeEncoding.GetBytes(unicodeString);
        char[] chars = unicodeEncoding.GetChars(utf16Bytes, 2, utf16Bytes.Length - 2);
        string s = new string(chars);
        Console.WriteLine();
        Console.WriteLine("Char Array:");
        foreach (char c in chars) Console.Write(c);
        Console.WriteLine();
        Console.WriteLine();
        Console.WriteLine("String from Char Array:");
        Console.WriteLine(s);

        Console.ReadKey();
    }
}
Jackass answered 20/9, 2016 at 15:53 Comment(0)
S
0

I have tested with TerraFX.Interop.Windows with string to char*.

I have found to resolve and I just invented own Converter of string and char pointer...

namespace DeafMan1983.Interop.Runtime.Utilities;

using System;

public static unsafe class UtilitiesForUTF16
{
    /*
     *  string from char* (UTF16)
     */
    public static string CharPointerToString(char* ptr)
    {
        if (ptr == null)
            return string.Empty;

        int length = 0;
        while (ptr[length] != '\0')
        {
            length++;
        }

        return new string(ptr, 0, length);
    }

    /*
     *  char* (UTF16) from string
     */
    public static char* StringToCharPointer(string input)
    {
        if (input == null)
            return null;

        char* utf16Ptr = stackalloc char[input.Length + 1];

        fixed (char* inputPtr = input)
        {
            for (int i = 0; i < input.Length; i++)
            {
                utf16Ptr[i] = inputPtr[i];
            }

            inputPtr[input.Length] = '\0';
            return inputPtr;
        }
    }

    /*
     *  It works like strlen, but it uses only char* (UTF16)
     */
    public static int CharPointerLength(char* charPtrs)
    {
        if (charPtrs == null)
            return 0;

        int length = 0;
        while (charPtrs[length] != '\0')
        {
            length++;
        }

        return length;
    }
}

Test with Program.cs

// Test for char* (UTF16)
string string_str1 = "Hello World!";
char* char_str1 = StringToCharPointer(string_str1);
Console.WriteLine($"Result: {CharPointerToString(char_str1)}");
Console.WriteLine($"Length of CharPointer: {CharPointerLength(char_str1)}");

char* char_str2 = stackalloc char[] { 'H', 'e', 'l', 'l', 'o' };
string str2 = CharPointerToString(char_str2);
Console.WriteLine($"Result: {str2}");
Console.WriteLine($"Length of CharPointer: {CharPointerLength(char_str2)}");

Check out result

And I have tested with TerraFX.Interop.Windows It doesn't show Chinese Langauge. Yay! But I don't know if it happens then we would like to put some Encoding classes. I will look for that.

Proof: StringToCharPointer() and Window Title shows normally.

Have Fun and enjoy your happy coding!

Stinson answered 19/5, 2024 at 16:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.