How can I get a Path from a raw C string (CStr or *const u8)?
Asked Answered
P

2

12

What's the most direct way to use a C string as Rust's Path?

I've got const char * from FFI and need to use it as a filesystem path in Rust.

  • I'd rather not enforce UTF-8 on the path, so converting through str/String is undesirable.
  • It should work on Windows at least for ASCII paths.

To clarify: I'm just replacing an existing C implementation that passes the path to fopen with a Rust stdlib implementation. It's not my problem whether it's a valid path or encoded properly for a given filesystem, as long as it's not worse than fopen (and I know fopen basically doesn't work on Windows).

Polyvalent answered 21/9, 2017 at 11:22 Comment(8)
Use CString::into_bytes with OsStringExt followed by PathBuf::from on Unix and String on Windows.Loverly
@Loverly But this would include an allocation and copying the string, right? The way I understood this question is that Kornel wants to avoid that and just work with a given c-string (right?). I expected there to be a conversion function from CStr to OsStr, but I can't find such a function :confused:Vanhoose
There doesn't seem to be an alloc-free approach, which is a bit intriguing but reflects the safety that we wish to impose on strings.Pollerd
@LukasKalbertodt Isn't OsStrExt::from_bytes such function? AFAICT an OsStr can be produced from a CStr using OsStrExt::from_bytes(cstr.to_bytes()). This will obviously only work on Unix, but that's unavoidable, since Rust on Windows uses a native OsStr implementation incompatible with char *. :/Auckland
@Pollerd I suppose an approach that is alloc-free where possible could be hacked together by implementing a Cow-like enum for C-backed paths. On Unix the enum would expose a cstring-backed OsStr& and on Windows it would allocate an OsString and expose its underlying OsStr&.Auckland
@Auckland Yes, by alloc-free I meant something that would never allocate, even conditionally.Pollerd
You can't have alloc-free and portability because you need to deal with the fact that Windows paths might be UTF-16 encoded. You can get alloc-free on Unix using OsStrExt.Loverly
"This will obviously only work on Unix" -- it's only "obvious" because Rust is so broken. "but that's unavoidable, since Rust on Windows uses a native OsStr implementation incompatible with char *. " -- this is utter nonsense.Hirsutism
P
5

Here's what I've learned:

  • Path/OsStr always use WTF-8 on Windows, and are an encoding-ignorant bag of bytes on Unix.

  • They never ever store any paths using any "wide" encoding like UTF-16 or UCS-2. The Windows-only masquerade of OsStr is to hide the WTF-8 encoding, nothing more.

  • It is extremely unlikely to ever change, because the standard library API supports creation of Path and OsStr from UTF-8 &str without any allocation or mutation of memory (i.e. as_ref() is supported, and its strict API doesn't leave room to implement it as anything other than a pointer cast).

Unix-only zero-copy version (it doesn't even depend on any implementation details):

use std::ffi::{CStr,OsStr};
use std::path::Path;
use std::os::unix::ffi::OsStrExt;

let slice = CStr::from_ptr(c_null_terminated_string_ptr_here);
let osstr = OsStr::from_bytes(slice.to_bytes());
let path: &Path = osstr.as_ref();

On Windows, converting only valid UTF-8 is the best Rust can do without a charade of creating WTF-8 OsString from code units:

…
let str = ::std::str::from_utf8(slice.to_bytes()).expect("keep your surrogates paired");
let path: &Path = str.as_ref();
Polyvalent answered 29/10, 2018 at 21:15 Comment(0)
A
3

Safely and portably? Insofar as I'm aware, there isn't a way. My advice is to demand UTF-8 and just pray it never breaks.

The problem is that the only thing you can really say about a "C string" is that it's NUL-terminated. You can't really say anything meaningful about how it's encoded. At least, not with any real certainty.

Unsafely and/or non-portably? If you're running on Linux (and possibly other modern *NIXen), you can maybe use OsStrExt to do the conversion. This only works assuming the C string was a valid path in the first place. If it came from some string processing code that wasn't using the same encoding as the filesystem (which these days is generally "arbitrary bytes that look like UTF-8 but might not be")... well, you'll have to convert it yourself, first.

On Windows? Hahahaha. This depends on where the string came from. C strings embedded in an executable can be in a variety of encodings depending on how the code was compiled. If it came from the OS itself, it could be in one of two different encodings: the thread's OEM codepage, or the thread's ANSI codepage. I never worked out how to check which it's set to. If it came from the console, it would be in whatever the console's input encoding was set to when you received it... assuming it wasn't piped in from something else that was using a different encoding (hi there, PowerShell!). All of the above require you to roll your own transcoding code, since Rust itself avoids this by never, ever using non-Unicode APIs on Windows.

Oh, and don't forget that there is no 8-bit encoding that can properly store Windows paths, since Windows paths are "arbitrary 16-bit words that look like UTF-16 but might not be". [1]

... so, like I said: demand UTF-8 and just pray it never breaks, because trying to do it "correctly" leads to madness.


[1]: I should clarify: there is such an encoding: WTF-8, which is what Rust uses for OsStr and OsString on Windows. The catch is that nothing else on Windows uses this, so it's never going to be how a C string is encoded.

Appellation answered 21/9, 2017 at 11:48 Comment(6)
@red75prime: But you absolutely cannot assume that all paths will be ASCII. Technically, you can't even assume that all possible encodings have ASCII as a common subset, since there are non-ASCII encodings still usable on Windows. But once you reach that point, it's hard to continue caring.Appellation
The Unix remark is too pessimistic. A const char * received from FFI is meant to be passed to functions such as fopen() (this is what any C code would do), so the OP needn't care about the encoding it's in. On the Rust side, that's precisely what OsStrExt::from_bytes is for. With the Windows part I couldn't agree more.Auckland
@user4815162342: There is nothing in the question about where this string is coming from, only that it's a C string. It could be coming from a database that encodes all it's strings in EBCDIC for all I know.Appellation
That's a theoretical possibility, but I suspect the OP would have mentioned that, or would have referred to the data as uint8_t * or equivalent. The fact that the OP wants to "use the const char * as a filesystem path" indicates that the file names are fine as they are, content-wise, but it takes some effort to convince Rust of the fact. (Going through strings enforces UTF-8, for example.)Auckland
there is no 8-bit encoding that can properly store Windows paths -- this is obviously nonsense; everything in memory has an 8-bit encoding. since Windows paths are "arbitrary 16-bit words that look like UTF-16 but might not be -- and POSIX paths are arbitrary byte sequences that don't contain NULs (with some fine print about slashes). None of this is relevant. Just as POSIX OS's don't check that paths are valid Unicode, neither does Windows. Rust breaks the round trip on Windows; other languages don't.Hirsutism
C strings embedded in an executable can be in a variety of encodings depending on how the code was compiled. ... -- All of which is, of course, also true on POSIX. And good luck trying to write fsck/chkdsk in the Rust "systems programming" language. D, e.g., doesn't have this problem because it understands that abstractions leak and system programmers need them to leak, and so it allows viewing byte sequences however the programmer wants, and only enforces Unicode if/when the programmer wants that.Hirsutism

© 2022 - 2024 — McMap. All rights reserved.