"Pathname can't be converted from UTF-8 to current locale" warning with Libarchive::Read module
Asked Answered
B

1

7

I'm getting the file listings for tar.gz files using the Libarchive::Read module. When a tarball file name has UTF-8 characters in it, I get an error which is generated by the libarchive C library:

Pathname can't be converted from UTF-8 to current locale.

in block at /Users/steve/.rakubrew/versions/moar-2022.12/share/perl6/site/sources/42AF7739DF41B2DA0C4BF2069157E2EF165CE93E (Libarchive::Read) line 228

The error is thrown with the Raku code here:

my $r := Libarchive::Read.new($newest_file);
my $needs_update = False;
for $r -> $entry {  # WARNING THROWN HERE for each file in tarball listing
    $entry.pathname;
    $needs_update = True if $entry.is-file && $entry.pathname && $entry.pathname ~~ / ( \.t || \.pm || \.pm6 ) $ / ;
    last if $needs_update;
}

I'm on a mac. The locale command reports the following:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

There seems to be a well-reported bug with the libarchive C library: https://github.com/libarchive/libarchive/issues/587.

Is there anyway to tell Raku to tell the module what locale is getting used so I can get the listing of tarballs with utf-8 characters?

Briefing answered 18/1, 2023 at 4:53 Comment(6)
The issue discussion looks diligent, intelligent, extensive. It remains open but it looks like a major directly relevant PR has been merged: Fix unpacking of filenames with contains UTF-8 characters. Maybe it would help if you reviewed that and edited your Q or comment to indicate how that does or doesn't help for your use case.Misguide
See also libarchive's wiki page Filenames, with sections such as "The Problem" (in particular, "It is also possible that the filename happens to be encoded in the same encoding as the local user's preference but again, there is no way that we can reliably detect this ... The proposed long-term solution below currently punts this to the client software; clients must be able to handle both UTF-8 and arbitrary byte sequence filenames.") and then the sections "Proposed Long-term Solution" and "Proposed Interim Solution".Misguide
ok, I edited it to make it more clear the C library was generating the error.Briefing
I have the locales set to"en_us.UTF-8". I'm not having any luck getting them set to "C.UTF-8" except for LANG environment variable on my mac. But I'm not even sure if it's worth the effort. Is there any important difference between "en_us.UTF-8" and "C.UTF-8"?Briefing
Yeah, so the "client" in this case would be the Raku module, right? So I have to somehow tell it to recognized the utf8 characters?Briefing
Does it make sense to accept your answer or is there something significant left hanging?Misguide
B
3

To workaround this problem, I moved to a more recent Raku module, Archive::Libarchive. This code works without complaining:

my Archive::Libarchive $a .= new: operation => LibarchiveRead, file => $newest_file.Str;
my Archive::Libarchive::Entry $entry .= new;

my $needs_update = False;
while $a.next-header($entry) {
     $a.data-skip;
     $needs_update = True if $entry.pathname.substr(*-1) ne '/' && $entry.pathname && $entry.pathname ~~ / ( \.t || \.pm || \.pm6 ) $ / ;
     last if $needs_update;
            }
$a.close;

This code also uses the libarchive C library but I guess in a way that knows how to work with utf-8 characters.

Briefing answered 18/1, 2023 at 17:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.