Conversion in .net: Native Utf-8 <-> Managed String
Asked Answered
T

3

17

I created those two methods to convert Native utf-8 strings (char*) into managed string and vice versa. The following code does the job:

public IntPtr NativeUtf8FromString(string managedString)
{
    byte[] buffer = Encoding.UTF8.GetBytes(managedString); // not null terminated
    Array.Resize(ref buffer, buffer.Length + 1);
    buffer[buffer.Length - 1] = 0; // terminating 0
    IntPtr nativeUtf8 = Marshal.AllocHGlobal(buffer.Length);
    Marshal.Copy(buffer, 0, nativeUtf8, buffer.Length);
    return nativeUtf8;
}

string StringFromNativeUtf8(IntPtr nativeUtf8)
{
    int size = 0;
    byte[] buffer = {};
    do
    {
        ++size;
        Array.Resize(ref buffer, size);
        Marshal.Copy(nativeUtf8, buffer, 0, size);
    } while (buffer[size - 1] != 0); // till 0 termination found

    if (1 == size)
    {
        return ""; // empty string
    }

    Array.Resize(ref buffer, size - 1); // remove terminating 0
    return Encoding.UTF8.GetString(buffer);
}

While NativeUtf8FromString is ok, StringFromNativeUtf8 is a mess but the only safe code I could get to run. Using unsafe code I could use an byte* but I do not want unsafe code. Is there another way someone can think of where I do not have to copy the string for every contained byte to find the 0 termination.


I just add the unsave code here:

public unsafe string StringFromNativeUtf8(IntPtr nativeUtf8)
{
    byte* bytes = (byte*)nativeUtf8.ToPointer();
    int size = 0;
    while (bytes[size] != 0)
    {
        ++size;
    }
    byte[] buffer = new byte[size];
    Marshal.Copy((IntPtr)nativeUtf8, buffer, 0, size);
    return Encoding.UTF8.GetString(buffer);
}

As you see its not ugly just needs unsafe.

Tower answered 27/5, 2012 at 11:0 Comment(7)
Why do you care about not using unsafe code?Decillion
@CodelnChaos: Not sure. Because Procect has to activate the /unsafe switch which feels dirty to me.Tower
The /unsafe switch is pretty meaningless. Marshal.* is just as unsafe as pointer code, even if it doesn't require the switch.Decillion
@CodelnChaos: I totally agree that marshalling is as unsafe as the pointer code but I thought its worth a question. Maybe there is an easy soluton that I just didn't find.Tower
@CodesInChaos: Surely /unsafe means you can break the CLR, and Marshal won't let you do that?Addiction
@Addiction Marshal.Copy allows you to write data to arbitrary memory locations, just like pointers allow you to write data to arbitrary memory locations. No difference in the damage you can do.Decillion
@CodesInChaos: there is a lot of difference, which is why one is unsafe and the other is not. This is not the place to debate -- ask a question if you like.Addiction
M
37

Just perform the exact same operation strlen() performs. Do consider keeping the buffer around, the code does generate garbage in a hurry.

    public static IntPtr NativeUtf8FromString(string managedString) {
        int len = Encoding.UTF8.GetByteCount(managedString);
        byte[] buffer = new byte[len + 1];
        Encoding.UTF8.GetBytes(managedString, 0, managedString.Length, buffer, 0);
        IntPtr nativeUtf8 = Marshal.AllocHGlobal(buffer.Length);
        Marshal.Copy(buffer, 0, nativeUtf8, buffer.Length);
        return nativeUtf8;
    }

    public static string StringFromNativeUtf8(IntPtr nativeUtf8) {
        int len = 0;
        while (Marshal.ReadByte(nativeUtf8, len) != 0) ++len;
        byte[] buffer = new byte[len];
        Marshal.Copy(nativeUtf8, buffer, 0, buffer.Length);
        return Encoding.UTF8.GetString(buffer);
    }
Ma answered 27/5, 2012 at 12:29 Comment(4)
byte[] buffer = new byte[len - 1]; should be byte[] buffer = new byte[len];Newlin
But your code includes to len all up to (but not including) to null terminator. So len contains the amount of characters without null terminator.Newlin
I could have sworn I tested this. Off-by-one bugs suck. Thanks.Ma
@HansPassant: Docs say that Encoding.UTF8 inserts a BOM. Is that not a problem here?Addiction
A
10

Slightly faster than Hans' solution (1 less buffer copy):

private unsafe IntPtr AllocConvertManagedStringToNativeUtf8(string input) {
    fixed (char* pInput = input) {
        var len = Encoding.UTF8.GetByteCount(pInput, input.Length);
        var pResult = (byte*)Marshal.AllocHGlobal(len + 1).ToPointer();
        var bytesWritten = Encoding.UTF8.GetBytes(pInput, input.Length, pResult, len);
        Trace.Assert(len == bytesWritten);
        pResult[len] = 0;
        return (IntPtr)pResult;
    }
}

private unsafe string MarshalNativeUtf8ToManagedString(IntPtr pStringUtf8)
    => MarshalNativeUtf8ToManagedString((byte*)pStringUtf8);

private unsafe string MarshalNativeUtf8ToManagedString(byte* pStringUtf8) {
    var len = 0;
    while (pStringUtf8[len] != 0) len++;
    return Encoding.UTF8.GetString(pStringUtf8, len);
}

Here's I demo round-tripping a string:

var input = "Hello, World!";
var native = AllocConvertManagedStringToNativeUtf8(input);
var copy = MarshalNativeUtf8ToManagedString(native);
Marshal.FreeHGlobal(native); // don't leak unmanaged memory!
Trace.Assert(input == copy); // prove they're equal!
Armandarmanda answered 12/10, 2019 at 21:7 Comment(1)
Assuming that you have a valid Unicode string as input, then the Assert on len == bytesWritten would never fail, correct? If that's true then you could make this a fair amount faster by over-allocating using input.Length * 4 + 1 bytes. GetBytes will tell you the actual byte length without having to parse the entire string twice (GetByteCount and then GetBytes).Mccandless
G
9

Marshal.PtrToStringUTF8 and Marshal.StringToCoTaskMemUTF8 were added in .NET 5 (.NET Standard 2.1)

Gadwall answered 24/1, 2021 at 18:11 Comment(1)
Why is it called StringToCoTaskMemUTF8 if it works on all platforms?Mojica

© 2022 - 2024 — McMap. All rights reserved.