How do I use filesystem functions in PHP, using UTF-8 strings?
Asked Answered
P

9

34

I can't use mkdir to create folders with UTF-8 characters:

<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>

when I browse this folder in Windows Explorer, the folder name looks like this:

Depósito

What should I do?

I'm using php5

Photogenic answered 6/10, 2009 at 14:10 Comment(0)
G
25

Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).

Caveats (all apply to the solutions below as well):

  • After url-encoding, the filename must be less that 255 characters (probably bytes).
  • UTF-8 has multiple representations for many characters (using combining characters). If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
  • You can't rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).

Worse Solutions

The following are less attractive solutions, more complicated and with more caveats.

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.

  2. Limit your file/directory names to characters representable in ISO-8859-1. In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.

Caveats galore!

  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode.

This nightmare is why you should probably just transliterate to create filenames.

Gothurd answered 25/10, 2009 at 14:28 Comment(3)
ISO-8859-1 is not more useful on Windows than ISO-8859-2 or ISO-8859-3. If you want to be safe, go with the 7-bit ASCII.Frighten
This answer doesn't work for me. mkdir('Depósito') creates Dep%C3%B3sito which I can't really believe is what the OP wants, even though he accepted this answer. See Umberto Salsi's answer for what is really going on and how to build a proper solution with setlocale() and iconv().Bethelbethena
PHP's behaviour has changed with PHP 7.1 - have a look at https://mcmap.net/q/162350/-how-do-i-use-filesystem-functions-in-php-using-utf-8-stringsMaurice
K
12

Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale()). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.

Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page". The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page.

This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.

More details are available in my reply to the PHP bug no. 47096.

Keikokeil answered 4/4, 2012 at 0:35 Comment(0)
A
9

PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage.

Amenity answered 19/7, 2016 at 19:17 Comment(2)
If a file (saved under PHP 7.0) is named Depósito on the filesystem, how does PHP 7.1 see it? I would think PHP 7.0 and 7.1 would see two different filenames, with BC implications.Gothurd
You're right. How 7.1 would see it, depends on the default_charset. Your example obviously witnesses the situation till 7.1 - UTF-8 string is passed to the ANSI API. To force the old behavior, it's only required to set the default_charset to some single byte codepage, usually to the system ANSI or OEM codepage. Otherwise, with default_charset=UTF-8 by default, file names are written and read correctly. More info here github.com/php/php-src/blob/PHP-7.1/UPGRADING#L391 . Thanks.Amenity
F
7

The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8. You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.

A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.

Frighten answered 6/10, 2009 at 14:19 Comment(1)
I haven't tried it, but can't you use mb_convert_encoding to convert the string the utf-16?Spoilsport
P
3

Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject, and then do everything you want with UTF-8 files/folders names.

I packaged this as a PHP stream wrapper, so it's very easy to use :

https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php

First verify that the com_dotnet extension is enabled in your php.ini then enable the wrapper with:

stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');

Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://

For example:

<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>
Pilpul answered 30/11, 2013 at 10:45 Comment(0)
I
3

You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio

$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);
Importation answered 3/9, 2015 at 10:6 Comment(0)
K
0

Try CodeIgniter Text helper from this link Read about convert_accented_characters() function, it can be costumised

Kirstiekirstin answered 20/2, 2012 at 11:42 Comment(0)
N
0

My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:

function define_cur_os(){

    //$cur_os=strtolower(php_uname());

    $cur_os=strtolower(PHP_OS);

    if(substr($cur_os, 0, 3) === 'win'){

        $cur_os='windows';

    }

    define('CUR_OS',$cur_os);

}

function filesystem_encode($file_name=''){

    $file_name=urldecode($file_name);

    if(CUR_OS=='windows'){

        $file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);

    }     

    return $file_name;

}

function custom_mkdir($dir_path='', $chmod=0755){

    $dir_path=filesystem_encode($dir_path);

    if(!is_dir($dir_path)){

        if(!mkdir($dir_path, $chmod, true)){

            //handle mkdir error

        }
    }
    return $dir_path;
}

function custom_fopen($dir_path='', $file_name='', $mode='w'){

    if($dir_path!='' && $file_name!=''){

        $dir_path=custom_mkdir($dir_path);

        $file_name=filesystem_encode($file_name);

        return fopen($dir_path.$file_name, $mode);

    }

    return false;

}

function custom_file_exists($file_path=''){

    $file_path=filesystem_encode($file_path);

    return file_exists($file_path);

}

function custom_file_get_contents($file_path=''){

    $file_path=filesystem_encode($file_path);

    return file_get_contents($file_path);

}

Additional resources

Neodarwinism answered 23/7, 2014 at 15:59 Comment(1)
iconv(): Detected an illegal character in input stringBedspring
C
0

I don't need to write much, it works well:

<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>
Cornejo answered 10/1, 2019 at 9:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.