The slugify
filter currently used in Django translates (roughly) to the following Perl code:
use Unicode::Normalize;
sub slugify($) {
my ($input) = @_;
$input = NFKD($input); # Normalize (decompose) the Unicode string
$input =~ tr/\000-\177//cd; # Strip non-ASCII characters (>127)
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode
(defined in Text::Unidecode
) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).
In that case, the function could look like:
use Unicode::Normalize;
use Text::Unidecode;
sub slugify_unidecode($) {
my ($input) = @_;
$input = NFC($input); # Normalize (recompose) the Unicode string
$input = unidecode($input); # Convert non-ASCII characters to closest equivalents
$input =~ s/[^\w\s-]//g; # Remove all characters that are not word characters (includes _), spaces, or hyphens
$input =~ s/^\s+|\s+$//g; # Trim whitespace from both ends
$input = lc($input);
$input =~ s/[-\s]+/-/g; # Replace all occurrences of spaces and hyphens with a single hyphen
return $input;
}
The former works well for strings that are primarily ASCII, but falls short when the entire string is formed of non-ASCII characters, since they all get stripped out, leaving you with an empty string.
Sample output:
string | slugify | slugify_unidecode
-------------------------------------------------
hello world hello world hello world
北亰 bei-jing
liberté liberta liberte
Note how 北亰 gets slugifies to nothing with the Django-inspired implementation. Note also the difference the NFC normalization makes -- liberté becomes 'liberta' with NFKD after stripping out the second part of the decomposed character, but would becomes 'libert' after stripping out the re-assembled 'é' with NFC.