MarshalAs(UnmanagedType.LPStr) - how does this convert utf-8 strings to char*
Asked Answered
P

4

14

The question title is basically what I'd like to ask:

[MarshalAs(UnmanagedType.LPStr)] - how does this convert utf-8 strings to char* ?

I use the above line when I attempt to communicate between c# and c++ dlls; more specifically, between:

somefunction(char *string) [c++ dll]

somefunction([MarshalAs(UnmanagedType.LPStr) string text) [c#]

When I send my utf-8 text (scintilla.Text) through c# and into my c++ dll, I'm shown in my VS 10 debugger that:

  1. the c# string was successfully converted to char*

  2. the resulting char* properly reflects the corresponding utf-8 chars (including the bit in Korean) in the watch window.

Here's a screenshot (with more details):

ss

As you can see, initialScriptText[0] returns the single byte(char): 'B' and the contents of char* initialScriptText are displayed properly (including Korean) in the VS watch window.

Going through the char pointer, it seems that English is saved as one byte per char, while Korean seems to be saved as two bytes per char. (the Korean word in the screenshot is 3 letters, hence saved in 6 bytes)

This seems to show that each 'letter' isn't saved in equal size containers, but differs depending on language. (possible hint on type?)

I'm trying to achieve the same result in pure c++: reading in utf-8 files and saving the result as char*.

Here's an example of my attempt to read a utf-8 file and convert to char* in c++:

enter image description here

observations:

  1. loss in visual when converting from wchar_t* to char*
  2. since result, s8 displays the string properly, I know I've converted the utf-8 file content in wchar_t* successfully to char*
  3. since 'result' retains the bytes I've taken directly from the file, but I'm getting a different result from what I had through c# (I've used the same file), I've concluded that the c# marshal has put the file contents through some other procedure to further mutate the text to char*.

(the screenshot also shows my terrible failure in using wcstombs)

note: I'm using the utf8 header from (http://utfcpp.sourceforge.net/)

Please correct me on any mistakes in my code/observations.

I'd like to be able to mimic the result I'm getting through the c# marshal and I've realised after going through all this that I'm completely stuck. Any ideas?

Pareira answered 8/11, 2012 at 12:30 Comment(3)
UTF-8 is a variable width encoding, so yes characters can be expressed as 1 or more bytes. Check the Wikipedia article for specifics.Samadhi
See GDAL how to support unicode characters in c#Lola
See GDAL unicode characters support in c#Lola
C
17

[MarshalAs(UnmanagedType.LPStr)] - how does this convert utf-8 strings to char* ?

It doesn't. There is no such thing as a "utf-8 string" in managed code, strings are always encoded in utf-16. The marshaling from and to an LPStr is done with the default system code page. Which makes it fairly remarkable that you see Korean glyphs in the debugger, unless you use code page 949.

If interop with utf-8 is a hard requirement then you need to use a byte[] in the pinvoke declaration. And convert back and forth yourself with System.Text.Encoding.UTF8. Use its GetString() method to convert the byte[] to a string, its GetBytes() method to convert a string to byte[]. Avoid all this if possible by using wchar_t[] in the native code.

Cortezcortical answered 8/11, 2012 at 13:45 Comment(1)
Thank you for the reply. I realised that I was so caught up in this issue of char* conversion that I blindly forgot about a more simpler wchar_t[] implementation.Pareira
N
15

While the other answers are correct, there has been a major development in .NET 4.7. Now there is an option that does exactly what UTF-8 needs: UnmanagedType.LPUTF8Str. I tried it and it works like a Swiss chronometre, doing exactly what it sounds like.

In fact, I even used MarshalAs(UnmanagedType.LPUTF8Str) in one parameter and MarshalAs(UnmanagedType.LPStr) in another. Also works. Here is my method (takes in string parameters and returns a string via a parameter):

[DllImport("mylib.dll", ExactSpelling = true, CallingConvention = CallingConvention.StdCall)] public static extern void ProcessContent([MarshalAs(UnmanagedType.LPUTF8Str)]string content, [MarshalAs(UnmanagedType.LPUTF8Str), Out]StringBuilder outputBuffer,[MarshalAs(UnmanagedType.LPStr)]string settings);

Thanks, Microsoft! Another nuisance is gone.

Nonresistant answered 21/3, 2018 at 3:34 Comment(2)
I was doing some stuff in netstandard2.0 and saw that this isn't there... see this link for which target frameworks support UnmanagedType.LPUTF8Str: apisof.net/catalog/…Virg
It was added in netstandard2.1: github.com/dotnet/standard/issues/595.Paynim
J
5

ICustomMarshaler can be used, in case of using .NET Framework earlier than 4.7.

class UTF8StringCodec : ICustomMarshaler
{
    public static ICustomMarshaler GetInstance(string cookie) => new UTF8StringCodec();

    public void CleanUpManagedData(object ManagedObj)
    {
        // nop
    }

    public void CleanUpNativeData(IntPtr pNativeData)
    {
        Marshal.FreeCoTaskMem(pNativeData);
    }

    public int GetNativeDataSize()
    {
        throw new NotImplementedException();
    }

    public IntPtr MarshalManagedToNative(object ManagedObj)
    {
        var text = $"{ManagedObj}";
        var bytes = Encoding.UTF8.GetBytes(text);
        var ptr = Marshal.AllocCoTaskMem(bytes.Length + 1);
        Marshal.Copy(bytes, 0, ptr, bytes.Length);
        Marshal.WriteByte(ptr, bytes.Length, 0);
        return ptr;
    }

    public object MarshalNativeToManaged(IntPtr pNativeData)
    {
        if (pNativeData == IntPtr.Zero)
        {
            return null;
        }

        var bytes = new MemoryStream();
        var ofs = 0;
        while (true)
        {
            var byt = Marshal.ReadByte(pNativeData, ofs);
            if (byt == 0)
            {
                break;
            }
            bytes.WriteByte(byt);
            ofs++;
        }

        return Encoding.UTF8.GetString(bytes.ToArray());
    }
}

P/Invoke declaration:

[DllImport("native.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static int NativeFunc(
    [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8StringCodec))] string path
);

Usage inside callback:

[StructLayout(LayoutKind.Sequential)]
struct Options
{
    [MarshalAs(UnmanagedType.FunctionPtr)]
    public CallbackFunc callback;
}

[UnmanagedFunctionPointer(CallingConvention.Cdecl)]
public delegate int CallbackFunc(
    [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8StringCodec))] string path
);
Jeffreys answered 18/11, 2020 at 15:45 Comment(0)
I
3

If you need to marshal UTF-8 string do it manually.

Define function with IntPtr instead of string:

somefunction(IntPtr text)

Then convert text to zero-terminated UTF8 array of bytes and write them to IntPtr:

byte[] retArray = Encoding.UTF8.GetBytes(text);
byte[] retArrayZ = new byte[retArray.Length + 1];
Array.Copy(retArray, retArrayZ, retArray.Length);
IntPtr retPtr = AllocHGlobal(retArrayZ.Length);
Marshal.Copy(retArrayZ, 0, retPtr, retArrayZ.Length);
somefunction(retPtr);      
Indefeasible answered 29/11, 2014 at 4:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.