Convert Unicode decomposition when transferring files to web server
Asked Answered
E

2

11

I am doing website development on OS X, and fairly often I find myself in situations where I move some part of a live website (running Linux/LAMP) to a development server running on my own machine. One such instance involves downloading images (user generated content, e.g. via ftp download), processing them in one way or another and the putting them back on the production site.

The image files involved, being created in a Linux machine, appears to have their filenames encoded in UTF-8 using NFC decomposition. OS X's HFS+ file system on the other hand does not allow NFC decomposed filenames and converts into NFD. However, once I am done and want to upload the files their names will now be using NFD decompositions, since Linux supports them both. As a result, the newly uploaded (and in some cases replaced) files will not be accessible at the expected URL.

I'm looking for a way to change the UTF decomposition of the files during (preferably) or after (convmv looks like a good option, but I don't have sufficient permissions on this server it's not possible in this particular case) transfer, since I'm guessing it's impossible doing it beforehand. I've tried FTP-upload using Transmit and rsync (using a deploy script a normally use) to no avail. the --iconv option in rsync seemed ideal, but unfortunately my server running rsync 2.6.9 did not recognize it.

I'm guessing quite a few people are having similar issues, I'll be happy to hear any solution or workaround!

UPDATE: In this case I ended up rsyncing the files to a virtual machine running Ubuntu, running convmv on them on there, and then rsyncing again to my staging server. While this works fairly well it is a bit time consuming. Perhaps it would be possible to mount an ext file system on OS X and just store the files there instead, using their original NFC decomposed file names?

Also, to avoid this problems all together on future WordPress installs, which was my use case, you could add a simple add_filter('sanitize_file_name', 'remove_accents'); before uploading any files and you should be fine.

Estrone answered 28/9, 2012 at 15:58 Comment(5)
Can you post an example of such a name? I haven't had issues with unicode filenames between OSX and Linux.Kurr
I'm on a swedish system, so mainly it is file names containing the characters å/Å, ä/Ä and ö/Ö, but also accented characters like é for example. Even if they're NFC on the remote machine they will become NFD as soon as they're on a HFS+ disk, and transferring them back does not change the back into NFC. In the Ubuntu shell I can actually see that an NFD å is different (made up of a+°), since the å that I type has a slightly smaller ring that's more unified with the a character.Estrone
I've reproduced the problem. It seems Linux actually does not do any unicode normalization at all, which is arguably a bug. Tested with a file named Ä, and after transferred to OSX and back, suddenly there were two files apparently with the same name, although the newly received file's name is considered a two-character name (matches ?? but not ?).Kurr
convmv really seems to be the best approach. What do you mean by "don't have sufficient permissions on the server" for convmv?Kurr
The site is was working on when asking this question is running on a shared hosting environment, where I have shell access and can download convmv, but oddly perl is missing. Installing perl is not possible due to permission/shell restrictions, which is a bit unclear in my phrasing above I guess.Estrone
V
6

It seems that rsync --iconv is the best solution, as you can transfer the files and transcode the names all in one step. You just need to convince your host to upgrade their rsync. Given that the --iconv feature was introduced in rsync 3.0.0, which was released in 2008, it's a bit odd that your host is still running rsync 2.6.9.

If you can't convince your host to install an up-to-date rsync, you could compile your own rsync, upload it somewhere like ~/bin on the server, and add that to your path before the system installed rsync. Then you should be able to use the --iconv option. This should work as long as you are using rsync over SSH (the default), not the rsync daemon; because rsync over SSH works by SSHing to the remote machine, and running rsync --server with the same options that you passed to your local rsync.

Or you could find a host that has up-to-date tools and Perl installed.

Viceregal answered 2/10, 2012 at 22:16 Comment(1)
I agree, and I am using rsync over SSH but in this particular case compiling my own rsync or perl is not an option, as not even machine is available on the server so I'm not sure what architecture to compile for. But definitely the best option.Estrone
C
6

Currently I'm using rsync --iconv like this:

Given Linux server and OS X machine:

Copying files from server to machine

You should execute this command from server (it won't work from OS X):

rsync --iconv=UTF-8,UTF-8-MAC /home/username/path/on/server/ '[email protected]:/Users/username/path/on/machine/'

Copying files from machine to server

You should execute this command from machine:

rsync --iconv=UTF-8-MAC,UTF-8 /Users/username/path/on/machine/ '[email protected]:/home/username/path/on/server/'
Chumley answered 10/9, 2014 at 14:10 Comment(1)
FWIW, OSX's rsync still (as of 10.12) doesn't have --iconv - luckily hombrew/dupes tap has an up to date rsync.Bubalo

© 2022 - 2024 — McMap. All rights reserved.