Conversion in .net: Native Utf-8 <-> Managed String

Asked 27/5, 2012 at 11:0 Answered 24/1, 2021 at 18:11

Solved c#string utf-8 marshalling native

I created those two methods to convert Native utf-8 strings (char*) into managed string and vice versa. The following code does the job:

public IntPtr NativeUtf8FromString(string managedString)
{
    byte[] buffer = Encoding.UTF8.GetBytes(managedString); // not null terminated
    Array.Resize(ref buffer, buffer.Length + 1);
    buffer[buffer.Length - 1] = 0; // terminating 0
    IntPtr nativeUtf8 = Marshal.AllocHGlobal(buffer.Length);
    Marshal.Copy(buffer, 0, nativeUtf8, buffer.Length);
    return nativeUtf8;
}

string StringFromNativeUtf8(IntPtr nativeUtf8)
{
    int size = 0;
    byte[] buffer = {};
    do
    {
        ++size;
        Array.Resize(ref buffer, size);
        Marshal.Copy(nativeUtf8, buffer, 0, size);
    } while (buffer[size - 1] != 0); // till 0 termination found

    if (1 == size)
    {
        return ""; // empty string
    }

    Array.Resize(ref buffer, size - 1); // remove terminating 0
    return Encoding.UTF8.GetString(buffer);
}

While NativeUtf8FromString is ok, StringFromNativeUtf8 is a mess but the only safe code I could get to run. Using unsafe code I could use an byte* but I do not want unsafe code. Is there another way someone can think of where I do not have to copy the string for every contained byte to find the 0 termination.

I just add the unsave code here:

public unsafe string StringFromNativeUtf8(IntPtr nativeUtf8)
{
    byte* bytes = (byte*)nativeUtf8.ToPointer();
    int size = 0;
    while (bytes[size] != 0)
    {
        ++size;
    }
    byte[] buffer = new byte[size];
    Marshal.Copy((IntPtr)nativeUtf8, buffer, 0, size);
    return Encoding.UTF8.GetString(buffer);
}

As you see its not ugly just needs unsafe.

Tower answered 27/5, 2012 at 11:0 Comment(7)

Why do you care about not using unsafe code? – Decillion 27/5, 2012 at 11:52

@CodelnChaos: Not sure. Because Procect has to activate the /unsafe switch which feels dirty to me. – Tower 27/5, 2012 at 11:59

The /unsafe switch is pretty meaningless. Marshal.* is just as unsafe as pointer code, even if it doesn't require the switch. – Decillion 27/5, 2012 at 12:10

@CodelnChaos: I totally agree that marshalling is as unsafe as the pointer code but I thought its worth a question. Maybe there is an easy soluton that I just didn't find. – Tower 27/5, 2012 at 12:14

@CodesInChaos: Surely /unsafe means you can break the CLR, and Marshal won't let you do that? – Addiction 8/3, 2016 at 1:50

@Addiction Marshal.Copy allows you to write data to arbitrary memory locations, just like pointers allow you to write data to arbitrary memory locations. No difference in the damage you can do. – Decillion 8/3, 2016 at 8:18

@CodesInChaos: there is a lot of difference, which is why one is unsafe and the other is not. This is not the place to debate -- ask a question if you like. – Addiction 8/3, 2016 at 12:39

Just perform the exact same operation strlen() performs. Do consider keeping the buffer around, the code does generate garbage in a hurry.

    public static IntPtr NativeUtf8FromString(string managedString) {
        int len = Encoding.UTF8.GetByteCount(managedString);
        byte[] buffer = new byte[len + 1];
        Encoding.UTF8.GetBytes(managedString, 0, managedString.Length, buffer, 0);
        IntPtr nativeUtf8 = Marshal.AllocHGlobal(buffer.Length);
        Marshal.Copy(buffer, 0, nativeUtf8, buffer.Length);
        return nativeUtf8;
    }

    public static string StringFromNativeUtf8(IntPtr nativeUtf8) {
        int len = 0;
        while (Marshal.ReadByte(nativeUtf8, len) != 0) ++len;
        byte[] buffer = new byte[len];
        Marshal.Copy(nativeUtf8, buffer, 0, buffer.Length);
        return Encoding.UTF8.GetString(buffer);
    }

Ma answered 27/5, 2012 at 12:29 Comment(4)

byte[] buffer = new byte[len - 1]; should be byte[] buffer = new byte[len]; – Newlin 3/8, 2014 at 2:6

But your code includes to len all up to (but not including) to null terminator. So len contains the amount of characters without null terminator. – Newlin 3/8, 2014 at 22:48

I could have sworn I tested this. Off-by-one bugs suck. Thanks. – Ma 3/8, 2014 at 22:57

@HansPassant: Docs say that Encoding.UTF8 inserts a BOM. Is that not a problem here? – Addiction 8/3, 2016 at 1:51

Slightly faster than Hans' solution (1 less buffer copy):

private unsafe IntPtr AllocConvertManagedStringToNativeUtf8(string input) {
    fixed (char* pInput = input) {
        var len = Encoding.UTF8.GetByteCount(pInput, input.Length);
        var pResult = (byte*)Marshal.AllocHGlobal(len + 1).ToPointer();
        var bytesWritten = Encoding.UTF8.GetBytes(pInput, input.Length, pResult, len);
        Trace.Assert(len == bytesWritten);
        pResult[len] = 0;
        return (IntPtr)pResult;
    }
}

private unsafe string MarshalNativeUtf8ToManagedString(IntPtr pStringUtf8)
    => MarshalNativeUtf8ToManagedString((byte*)pStringUtf8);

private unsafe string MarshalNativeUtf8ToManagedString(byte* pStringUtf8) {
    var len = 0;
    while (pStringUtf8[len] != 0) len++;
    return Encoding.UTF8.GetString(pStringUtf8, len);
}

Here's I demo round-tripping a string:

var input = "Hello, World!";
var native = AllocConvertManagedStringToNativeUtf8(input);
var copy = MarshalNativeUtf8ToManagedString(native);
Marshal.FreeHGlobal(native); // don't leak unmanaged memory!
Trace.Assert(input == copy); // prove they're equal!

Armandarmanda answered 12/10, 2019 at 21:7 Comment(1)

Assuming that you have a valid Unicode string as input, then the Assert on len == bytesWritten would never fail, correct? If that's true then you could make this a fair amount faster by over-allocating using input.Length * 4 + 1 bytes. GetBytes will tell you the actual byte length without having to parse the entire string twice (GetByteCount and then GetBytes). – Mccandless 8/10, 2021 at 18:23

Marshal.PtrToStringUTF8 and Marshal.StringToCoTaskMemUTF8 were added in .NET 5 (.NET Standard 2.1)

Gadwall answered 24/1, 2021 at 18:11 Comment(1)

Why is it called StringToCoTaskMemUTF8 if it works on all platforms? – Mojica 29/9, 2023 at 13:28

Recommended topics

Hot tags