Set UTF-8 pathname header in libarchive
Asked Answered
N

2

8

SUMMARY

How can I write a zip file using libarchive in C++, such that path names will be UTF-8 encoded? With UTF-8 path names, special characters will be decoded correctly when using OS X / Linux / Windows 8 / 7-Zip / WinZip.

DETAILS

I am trying to write a zip archive using libarchive, compiling with Visual C++ 2013 on Windows.

I would like to be able to add files with non-ASCII chars (e.g. äöü.txt) to the zip archive.

There are four functions to set the pathname header in libarchive:

void archive_entry_set_pathname(struct archive_entry *, const char *);
void archive_entry_copy_pathname(struct archive_entry *, const char *);
void archive_entry_copy_pathname_w(struct archive_entry *, const wchar_t *);
int  archive_entry_update_pathname_utf8(struct archive_entry *, const char *);

Unfortunately, none of them seem to work.

In particular, I have tried:

const char* myUtf8Str = ...
archive_entry_update_pathname_utf8(entry, myUtf8Str);
// this sounded like the most straightforward solution

and

const wchar_t* myUtf16Str = ...
archive_entry_copy_pathname_w(entry, myUtf16Str);
// UTF-16 encoded strings seem to be the default on Windows

In both cases, the resulting zip archive does not show the file names correctly in both Windows Explorer and 7-Zip.

I am certain that my input strings are encoded correctly, since I convert them from Qt QString instances that work perfectly well in other parts of my code:

const char* myUtf8Str = filename.toUtf8().constData();
const wchar_t* myUtf16Str = filename.toStdWString().c_str();

For instance, this works even for another call to libarchive, when creating the zip file:

archive_write_open_filename_w(archive, zipFile.toStdWString().c_str());
// creates a zip archive file where the non-ASCII
// chars are encoded correctly, e.g. äöü.zip

I have also tried to change the options for libarchive, as suggested by this example:

archive_write_set_options(a, "hdrcharset=UTF-8");

But this call fails, so I assume that I have to set some other option, but I'm running out of ideas...

UPDATE 2

I have done some more reading about the zip format. It allows writing file names in UTF-8, such that OS X / Linux / Windows 8 / 7-Zip / WinZip will always decode them correctly, see e.g. here.

This is what I want to achieve using libarchive, i.e. I would like to pass it my UTF-8 encoded pathname and have it store that in the zip file without doing any conversion.

I have added the "set locale" approach as an (unsatisfying) answer.

Neral answered 3/12, 2014 at 9:31 Comment(7)
Related: code.google.com/p/libarchive/issues/detail?id=247. A suggestion from that would be to do setlocale(LC_ALL, "");Cleopatracleopatre
Thanks, I updated the question to address this.Neral
you should make that an answer if it solves your problem, and remove it from the question.Cleopatracleopatre
// replace the C++ global locale as well as the C locale with the user-preferred locale std::locale::global(std::locale("")); See the example here - en.cppreference.com/w/cpp/locale/locale.Killing
Well, it seems to work on my machine, but I would call it an ugly hack rather than a solution... I'll accept the answer if nobody provides a more elegant solution.Neral
Everything even remotely related to locales and Unicode looks as "an ugly hack". And probably is...Depending on the point of view :)Killing
UTF-8 is not an ugly hack, so that's what I'm really looking for.Neral
N
2

This is a workaround that will store path names using the system's locale settings, i.e. the resulting zip file can be decoded correctly on the same system, but is not portable.

This is not satisfying, I am just posting this to show that it is not what I am looking for.

Set the global locale to "" as explained here:

std::locale::global(std::locale(""));

and then read it back:

std::locale loc;
std::cout << loc.name() << std::endl;
// output: English_United States.1252
// may of course be different depending on system settings

Then set pathname by using archive_entry_update_pathname_utf8.

The zip file now contains file names encoded with Windows-1252, so my Windows can read them, but they appear as garbage on e.g. Linux.

Future

There is a libarchive issue for UTF-8 filenames. The whole story is quite complicated, but it sounds like they may add better UTF-8 support in libarchive 4.0.

Neral answered 5/12, 2014 at 8:42 Comment(0)
G
0

I got UTF-8 filenames working in ZIP archives using libarchive-3.3.3, with using this exact flow (the sequence is important!):

entry = archive_entry_new();
archive_entry_set_pathname_utf8(entry, utf8Filename);
archive_entry_set_pathname(entry, utf8Filename);

When switching archive_entry_set_pathname_utf8 / archive_entry_set_pathname the entries are garbled in Windows Explorer's ZIP functionality. This worked for me for german umlauts (but should do for every UTF-8 character). This even worked for 2-byte and 3-byte UTF-8 characters (NFC/NFD).

//Addition The process must be run in an environment with a LANG variable set to a UTF-8 capable locale (i.e. "LANG=de_DE.UTF-8" in my case). Without this environment, the process won't generate correct UTF-8 characters.

Goode answered 17/8, 2019 at 23:37 Comment(1)
Unfortunately this does not make any difference for me. I still fail to create UTF-8 encoded archive on Windows with libarchive.Ferity

© 2022 - 2024 — McMap. All rights reserved.