Reformulating the question, because
- @optional asked me
- it wasn't clear and linked one HTML::Mason based solution Four easy steps to make Mason UTF-8 Unicode clean with Apache, mod_perl, and DBI , what caused confusions
- the original is 4 years old and meantime (in 2012) the "poet" is created
Comment: This question already earned the "popular question badge", so probably i'm not the only hopeless person. :)
Unfortunately, demonstrating the full problem stack leads to an very long question and it is very Mason specific.
First, the opinions-only part :)
I'm using HTML::Mason over ages, and now trying to use Mason2. The Poet and Mason are the most advanced frameworks in the CPAN. Found nothing comparamble, what out-of-box allows write so clean /but very hackable :)/ web-apps, with many batteries included (logging, cacheing, config-management, native PGSI based, etc...)
Unfortunately, the author doesn't care about the rest of the word, e.g. by default, it is only ascii based, without any manual, faq or advices about: how to use it with unicode
Now the facts. Demo. Create an poet app:
poet new my #the "my" directory is the $poet_root
mkdir -p my/comps/xls
cd my/comps/xls
and add into the dhandler.mc
the following (what will demostrating the two basic problems)
<%class>
has 'dwl';
use Excel::Writer::XLSX;
</%class>
<%init>
my $file = $m->path_info;
$file =~ s/[^\w\.]//g;
my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file;
if( $.dwl ) {
#create xlsx in the memory
my $excel;
open my $fh, '>', \$excel or die "Failed open scalar: $!";
my $workbook = Excel::Writer::XLSX->new( $excel );
my $worksheet = $workbook->add_worksheet();
$worksheet->write(0, 0, $cell);
$workbook->close();
#poet/mason output
$m->clear_buffer;
$m->res->content_type("application/vnd.ms-excel");
$m->print($excel);
$m->abort();
}
</%init>
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>
and run the app
../bin/run.pl
go to http://0:5000/xls/hello.xlsx and you will get:
+----------------------------+
| ÅngstrÖm in the hello.xlsx |
+----------------------------+
download hello.xlsx
Clicking the download hello.xlsx, you will get hello.xlsx
in the downloads.
The above demostrating the first problem,
e.g. the component's source arent "under" the use utf8;
,
so the lc
doesn't understand characters.
The second problem is the following, try the [http://0:5000/xls/hélló.xlsx] , or http://0:5000/xls/h%C3%A9ll%C3%B3.xlsx and you will see:
+--------------------------+
| ÅngstrÖm in the hll.xlsx |
+--------------------------+
download hll.xlsx
#note the wrong filename
Of course, the input (the path_info
) isn't decoded, the script works with the utf8 encoded octets and not with perl characters.
So, telling perl - "the source is in utf8", by adding the use utf8;
into the <%class%>
, results
+--------------------------+
| �ngstr�m in the hll.xlsx |
+--------------------------+
download hll.xlsx
adding use feature 'unicode_strings'
(or use 5.014;
) even worse:
+----------------------------+
| �ngstr�m in the h�ll�.xlsx |
+----------------------------+
download h�ll�.xlsx
Of course, the source now contains wide characters, it needs Encode::encode_utf8
at the output.
One could try use an filter such:
<%filter uencode><% Encode::encode_utf8($yield->()) %></%filter>
and filter the whole output:
% $.uencode {{
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>
% }}
but this helps only partially, because need care about the encoding in the <%init%>
or <%perl%>
blocks.
Encoding/decoding inside of the perl code at many places, (read: not at the borders) leads to an spagethy code.
The encoding/decoding should be clearly done somewhere at the Poet/Mason borders - of course, the Plack operates on the byte level.
Partial solution.
Happyly, the Poet cleverly allows modify it's (and Mason's) parts, so,
in the $poet_root/lib/My/Mason
you could modify the Compilation.pm
to:
override 'output_class_header' => sub {
return join("\n",
super(), qq(
use 5.014;
use utf8;
use Encode;
)
);
};
what will insert the wanted preamble into every Mason component. (Don't forget touch every component, or simply remove the compiled objects from the $poet_root/data/obj
).
Also you could try handle the request/responses at the borders,
by editing the $poet_root/lib/My/Mason/Request.pm
to:
#found this code somewhere on the net
use Encode;
override 'run' => sub {
my($self, $path, $args) = @_;
#decode values - but still missing the "keys" decode
foreach my $k (keys %$args) {
$args->set($k, decode_utf8($args->get($k)));
}
my $result = super();
#encode the output - BUT THIS BREAKS the inline XLS
$result->output( encode_utf8($result->output()) );
return $result;
};
Encode everything is an wrong strategy, it breaks e.g. the XLS.
So, 4 years after (i asked the original question in 2011) still don't know :( how to use correctly the unicode in the Mason2 applications and still doesn't exists any documentation or helpers about it. :(
The main questions are: - where (what methods should be modified by Moose's method modifiers) and how correctly decode the inputs and where the output (in the Poet/Mason app.)
- but only textual ones, e.g.
text/plain
ortext/html
and such... - a do the above "surprise free" - e.g. what will simply works. ;)
Could someone please help with real code - what i should modify in the above?
run
checking the existence of the charset and content-type, if doesn't already set - set it to utf8 and do theencode
(if the content-type isplain
orhtml
). Ok, this logic is acceptable util you don't want serve othermime-type
, liketext/css
. Rewrite therun
every-time, when serving an new mime-type isn't very dev-friendly. Hm, maybe, i could add another (3rd)elseif
what should say:if the charset is already set
- do theencode
for any mime-type and this will indicate the need of the encode. Acceptable - but will wait until the bounty ends. ;) – Auto