How to make Mason2 UTF-8 clean?
Asked Answered
A

3

23

Reformulating the question, because

Comment: This question already earned the "popular question badge", so probably i'm not the only hopeless person. :)

Unfortunately, demonstrating the full problem stack leads to an very long question and it is very Mason specific.

First, the opinions-only part :)

I'm using HTML::Mason over ages, and now trying to use Mason2. The Poet and Mason are the most advanced frameworks in the CPAN. Found nothing comparamble, what out-of-box allows write so clean /but very hackable :)/ web-apps, with many batteries included (logging, cacheing, config-management, native PGSI based, etc...)

Unfortunately, the author doesn't care about the rest of the word, e.g. by default, it is only ascii based, without any manual, faq or advices about: how to use it with unicode

Now the facts. Demo. Create an poet app:

poet new my #the "my" directory is the $poet_root
mkdir -p my/comps/xls
cd my/comps/xls

and add into the dhandler.mc the following (what will demostrating the two basic problems)

<%class>
    has 'dwl';
    use Excel::Writer::XLSX;
</%class>
<%init>
    my $file = $m->path_info;

    $file =~ s/[^\w\.]//g;
    my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file;

    if( $.dwl ) {
        #create xlsx in the memory
        my $excel;
        open my $fh, '>', \$excel or die "Failed open scalar: $!";
        my $workbook  = Excel::Writer::XLSX->new( $excel );
        my $worksheet = $workbook->add_worksheet();
        $worksheet->write(0, 0, $cell);
        $workbook->close();

        #poet/mason output
        $m->clear_buffer;
        $m->res->content_type("application/vnd.ms-excel");
        $m->print($excel);
        $m->abort();
    }
</%init>
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>

and run the app

../bin/run.pl

go to http://0:5000/xls/hello.xlsx and you will get:

+----------------------------+
| ÅngstrÖm in the hello.xlsx |
+----------------------------+
download hello.xlsx

Clicking the download hello.xlsx, you will get hello.xlsx in the downloads.

The above demostrating the first problem, e.g. the component's source arent "under" the use utf8;, so the lc doesn't understand characters.

The second problem is the following, try the [http://0:5000/xls/hélló.xlsx] , or http://0:5000/xls/h%C3%A9ll%C3%B3.xlsx and you will see:

+--------------------------+
| ÅngstrÖm in the hll.xlsx |
+--------------------------+
download hll.xlsx
#note the wrong filename

Of course, the input (the path_info) isn't decoded, the script works with the utf8 encoded octets and not with perl characters.

So, telling perl - "the source is in utf8", by adding the use utf8; into the <%class%>, results

+--------------------------+
| �ngstr�m in the hll.xlsx |
+--------------------------+
download hll.xlsx

adding use feature 'unicode_strings' (or use 5.014;) even worse:

+----------------------------+
| �ngstr�m in the h�ll�.xlsx |
+----------------------------+
download h�ll�.xlsx

Of course, the source now contains wide characters, it needs Encode::encode_utf8 at the output.

One could try use an filter such:

<%filter uencode><% Encode::encode_utf8($yield->()) %></%filter>

and filter the whole output:

% $.uencode {{
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>
% }}

but this helps only partially, because need care about the encoding in the <%init%> or <%perl%> blocks. Encoding/decoding inside of the perl code at many places, (read: not at the borders) leads to an spagethy code.

The encoding/decoding should be clearly done somewhere at the Poet/Mason borders - of course, the Plack operates on the byte level.


Partial solution.

Happyly, the Poet cleverly allows modify it's (and Mason's) parts, so, in the $poet_root/lib/My/Mason you could modify the Compilation.pm to:

override 'output_class_header' => sub {
    return join("\n",
        super(), qq(
        use 5.014;
        use utf8;
        use Encode;
        )
    );
};

what will insert the wanted preamble into every Mason component. (Don't forget touch every component, or simply remove the compiled objects from the $poet_root/data/obj).

Also you could try handle the request/responses at the borders, by editing the $poet_root/lib/My/Mason/Request.pm to:

#found this code somewhere on the net
use Encode;
override 'run' => sub {
    my($self, $path, $args) = @_;

    #decode values - but still missing the "keys" decode
    foreach my $k (keys %$args) {
        $args->set($k, decode_utf8($args->get($k)));
    }

    my $result = super();

    #encode the output - BUT THIS BREAKS the inline XLS
    $result->output( encode_utf8($result->output()) );
    return $result;
};

Encode everything is an wrong strategy, it breaks e.g. the XLS.

So, 4 years after (i asked the original question in 2011) still don't know :( how to use correctly the unicode in the Mason2 applications and still doesn't exists any documentation or helpers about it. :(

The main questions are: - where (what methods should be modified by Moose's method modifiers) and how correctly decode the inputs and where the output (in the Poet/Mason app.)

  • but only textual ones, e.g. text/plain or text/html and such...
  • a do the above "surprise free" - e.g. what will simply works. ;)

Could someone please help with real code - what i should modify in the above?

Auto answered 2/5, 2011 at 14:47 Comment(0)
B
1

OK, I've tested this with Firefox. The HTML displays the UTF-8 correctly and leaves the zip alone, so should work everywhere.

If you start with poet new My to apply the patch you need patch -p1 -i...path/to/thisfile.diff.

diff -ruN orig/my/comps/Base.mc new/my/comps/Base.mc
--- orig/my/comps/Base.mc   2015-05-20 21:48:34.515625000 -0700
+++ new/my/comps/Base.mc    2015-05-20 21:57:34.703125000 -0700
@@ -2,9 +2,10 @@
 has 'title' => (default => 'My site');
 </%class>

-<%augment wrap>
-  <html>
+<%augment wrap><!DOCTYPE html>
+  <html lang="en-US">
     <head>
+      <meta charset="utf-8">
       <link rel="stylesheet" href="/static/css/style.css">
 % $.Defer {{
       <title><% $.title %></title>
diff -ruN orig/my/comps/xls/dhandler.mc new/my/comps/xls/dhandler.mc
--- orig/my/comps/xls/dhandler.mc   1969-12-31 16:00:00.000000000 -0800
+++ new/my/comps/xls/dhandler.mc    2015-05-20 21:53:42.796875000 -0700
@@ -0,0 +1,30 @@
+<%class>
+    has 'dwl';
+    use Excel::Writer::XLSX;
+</%class>
+<%init>
+    my $file = $m->path_info;
+    $file = decode_utf8( $file );
+    $file =~ s/[^\w\.]//g;
+    my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file ;
+    if( $.dwl ) {
+        #create xlsx in the memory
+        my $excel;
+        open my $fh, '>', \$excel or die "Failed open scalar: $!";
+        my $workbook  = Excel::Writer::XLSX->new( $fh );
+        my $worksheet = $workbook->add_worksheet();
+        $worksheet->write(0, 0, $cell);
+        $workbook->close();
+
+        #poet/mason output
+        $m->clear_buffer;
+        $m->res->content_type("application/vnd.ms-excel");
+        $m->print($excel);
+        $m->abort();
+    }
+</%init>
+<table border=1>
+<tr><td><% $cell %></td></tr>
+</table>
+<p> <a href="%c3%85%4e%47%53%54%52%c3%96%4d%20%68%c3%a9%6c%6c%c3%b3">ÅNGSTRÖM hélló</a>
+<p> <a href="?dwl=yes">download <% $file %></a>
diff -ruN orig/my/lib/My/Mason/Compilation.pm new/my/lib/My/Mason/Compilation.pm
--- orig/my/lib/My/Mason/Compilation.pm 2015-05-20 21:48:34.937500000 -0700
+++ new/my/lib/My/Mason/Compilation.pm  2015-05-20 21:49:54.515625000 -0700
@@ -5,11 +5,13 @@
 extends 'Mason::Compilation';

 # Add customizations to Mason::Compilation here.
-#
-# e.g. Add Perl code to the top of every compiled component
-#
-# override 'output_class_header' => sub {
-#      return join("\n", super(), 'use Foo;', 'use Bar qw(baz);');
-# };
-
+override 'output_class_header' => sub {
+    return join("\n",
+        super(), qq(
+        use 5.014;
+        use utf8;
+        use Encode;
+        )
+    );
+};
 1;
\ No newline at end of file
diff -ruN orig/my/lib/My/Mason/Request.pm new/my/lib/My/Mason/Request.pm
--- orig/my/lib/My/Mason/Request.pm 2015-05-20 21:48:34.968750000 -0700
+++ new/my/lib/My/Mason/Request.pm  2015-05-20 21:55:03.093750000 -0700
@@ -4,20 +4,27 @@

 extends 'Mason::Request';

-# Add customizations to Mason::Request here.
-#
-# e.g. Perform tasks before and after each Mason request
-#
-# override 'run' => sub {
-#     my $self = shift;
-#
-#     do_tasks_before_request();
-#
-#     my $result = super();
-#
-#     do_tasks_after_request();
-#
-#     return $result;
-# };
+use Encode qw/ encode_utf8 decode_utf8 /;

-1;
\ No newline at end of file
+override 'run' => sub {
+    my($self, $path, $args) = @_;
+    foreach my $k (keys %$args) {
+        my $v = $args->get($k);
+        $v=decode_utf8($v);
+        $args->set($k, $v);
+    }
+    my $result = super();
+    my( $ctype, $charset ) = $self->res->headers->content_type_charset;
+    if( ! $ctype ){
+        $ctype = 'text/html';
+        $charset = 'UTF-8';
+        $self->res->content_type( "$ctype; $charset");
+        $result->output( encode_utf8(''.( $result->output())) );
+    } elsif( ! $charset and $ctype =~ m{text/(?:plain|html)} ){
+        $charset = 'UTF-8';
+        $self->res->content_type( "$ctype; $charset");
+        $result->output( encode_utf8(''.( $result->output())) );
+    }
+    return $result;
+};
+1;
Burse answered 21/5, 2015 at 5:2 Comment(2)
So, you recommends in the run checking the existence of the charset and content-type, if doesn't already set - set it to utf8 and do the encode (if the content-type is plain or html). Ok, this logic is acceptable util you don't want serve other mime-type, like text/css. Rewrite the run every-time, when serving an new mime-type isn't very dev-friendly. Hm, maybe, i could add another (3rd) elseif what should say: if the charset is already set - do the encode for any mime-type and this will indicate the need of the encode. Acceptable - but will wait until the bounty ends. ;)Auto
good point, probably /^text/ is generic enough in addition to another elsif ... might also be a good idea to make sure $path is unicode like my $result = $self->SUPER::run( decode_utf8( $path ), $args );Burse
P
4

The Mason2 manual presents the way component inheritance works, so I think that putting this common code in your main Base.mp component (from which all the other inherit) might solve your issue.

Creating plugins is described in Mason::Manual::Plugins.

So, you can build your own plugin that modifies Mason::Request and by overriding the request_args() you can return the UTF-8 decoded parameters.

Edit:

Regarding the UTF-8 output, you can add an Apache directive to ensure that text/plain and text/HTML outputs are always interpreted as UTF-8 :

AddDefaultCharset utf-8
Phi answered 20/5, 2011 at 5:11 Comment(4)
The problem has more parts: At the input - yes - need convert octets into utf8 for all GET/POST args (and probably PATH_INFO too). At the output the easy one - sending correct xml/html headers. But at the output, whatever what want send to browser, before it is leave Mason need convert utf8 into octets, because Plack will complain. And ofcourse, need another plugin what automatcally insert into every component source "use utf8;" - because all components will use utf8 at the code... Plus "use open (:std :utf8);" pragma too..Auto
but, thanx for the tip - modifying the request_arg method.Auto
Apache directive is not a solution. Mason can work as CLI, and under Plack. Therefore, all utf-8 conversions must be done at mason level. (plugins). Fortunately, setting up correct header is easy. Looking for a help at mason plugin source code. What classes and what methods i should modify and how.Auto
Since here are no answer what i wanted to see, but youre alone, i award you with the bounty. :)Auto
W
1

In the mason-users mailing list was a question about handling UTF-8 for

  1. components output with UTF-8
  2. handling UTF-8 GET/POST arguments

Here is Jon's answer:

I'd like Mason to handle encoding intelligently, but since I don't regularly work with utf8, you and others will have to help me with the design.

This should probably be in a plugin, e.g. Mason::Plugin::UTF8.

So for the things you particularly mention, something like this might work:

package Mason::Plugin::UTF8;
use Moose;
with 'Mason::Plugin';
1;

package Mason::Plugin::UTF8::Request;
use Mason::PluginRole;
use Encode;

# Encode all output in utf8 - ** only works with Mason 2.13 and beyond **
#
after 'process_output' => sub {
    my ($self, $outref) = @_;
    $$outref = encode_utf8( $$outref );
};

# Decode all parameters as utf8
#
around 'run' => sub {
    my $orig = shift;
    my $self = shift;

    my %params = @_;
    while (my ($key, $value) = each(%params)) {
        $value = decode_utf8($value);
    }
    $self->$orig(%params);
}

1;

It would probably be best if you or someone else knowledgable about utf8 issues created this plugin rather than myself. But let me know if there are things needed in the Mason core to make this easier.

IMHO, it is needed add the following too, for adding "use utf8;" into every component.

package Mason::Plugin::UTF8::Compilation;
use Mason::PluginRole;
override 'output_class_header' => sub {
    return(super() . 'use utf8;');
};

1;
Witness answered 24/7, 2011 at 14:32 Comment(2)
The solution where $$outref = encode_utf8( $$outref ); is works only partially. For example it (of course) breaks every inline generated Excel document or any other generated binary with Mason2 - e.g. isn't acceptable encode_utf8 everything.Auto
kobame please update your question with short example of this "Excel" breakage ... maybe you can avoid encoding as utf8 if clear_buffer+abort were calledBurse
B
1

OK, I've tested this with Firefox. The HTML displays the UTF-8 correctly and leaves the zip alone, so should work everywhere.

If you start with poet new My to apply the patch you need patch -p1 -i...path/to/thisfile.diff.

diff -ruN orig/my/comps/Base.mc new/my/comps/Base.mc
--- orig/my/comps/Base.mc   2015-05-20 21:48:34.515625000 -0700
+++ new/my/comps/Base.mc    2015-05-20 21:57:34.703125000 -0700
@@ -2,9 +2,10 @@
 has 'title' => (default => 'My site');
 </%class>

-<%augment wrap>
-  <html>
+<%augment wrap><!DOCTYPE html>
+  <html lang="en-US">
     <head>
+      <meta charset="utf-8">
       <link rel="stylesheet" href="/static/css/style.css">
 % $.Defer {{
       <title><% $.title %></title>
diff -ruN orig/my/comps/xls/dhandler.mc new/my/comps/xls/dhandler.mc
--- orig/my/comps/xls/dhandler.mc   1969-12-31 16:00:00.000000000 -0800
+++ new/my/comps/xls/dhandler.mc    2015-05-20 21:53:42.796875000 -0700
@@ -0,0 +1,30 @@
+<%class>
+    has 'dwl';
+    use Excel::Writer::XLSX;
+</%class>
+<%init>
+    my $file = $m->path_info;
+    $file = decode_utf8( $file );
+    $file =~ s/[^\w\.]//g;
+    my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file ;
+    if( $.dwl ) {
+        #create xlsx in the memory
+        my $excel;
+        open my $fh, '>', \$excel or die "Failed open scalar: $!";
+        my $workbook  = Excel::Writer::XLSX->new( $fh );
+        my $worksheet = $workbook->add_worksheet();
+        $worksheet->write(0, 0, $cell);
+        $workbook->close();
+
+        #poet/mason output
+        $m->clear_buffer;
+        $m->res->content_type("application/vnd.ms-excel");
+        $m->print($excel);
+        $m->abort();
+    }
+</%init>
+<table border=1>
+<tr><td><% $cell %></td></tr>
+</table>
+<p> <a href="%c3%85%4e%47%53%54%52%c3%96%4d%20%68%c3%a9%6c%6c%c3%b3">ÅNGSTRÖM hélló</a>
+<p> <a href="?dwl=yes">download <% $file %></a>
diff -ruN orig/my/lib/My/Mason/Compilation.pm new/my/lib/My/Mason/Compilation.pm
--- orig/my/lib/My/Mason/Compilation.pm 2015-05-20 21:48:34.937500000 -0700
+++ new/my/lib/My/Mason/Compilation.pm  2015-05-20 21:49:54.515625000 -0700
@@ -5,11 +5,13 @@
 extends 'Mason::Compilation';

 # Add customizations to Mason::Compilation here.
-#
-# e.g. Add Perl code to the top of every compiled component
-#
-# override 'output_class_header' => sub {
-#      return join("\n", super(), 'use Foo;', 'use Bar qw(baz);');
-# };
-
+override 'output_class_header' => sub {
+    return join("\n",
+        super(), qq(
+        use 5.014;
+        use utf8;
+        use Encode;
+        )
+    );
+};
 1;
\ No newline at end of file
diff -ruN orig/my/lib/My/Mason/Request.pm new/my/lib/My/Mason/Request.pm
--- orig/my/lib/My/Mason/Request.pm 2015-05-20 21:48:34.968750000 -0700
+++ new/my/lib/My/Mason/Request.pm  2015-05-20 21:55:03.093750000 -0700
@@ -4,20 +4,27 @@

 extends 'Mason::Request';

-# Add customizations to Mason::Request here.
-#
-# e.g. Perform tasks before and after each Mason request
-#
-# override 'run' => sub {
-#     my $self = shift;
-#
-#     do_tasks_before_request();
-#
-#     my $result = super();
-#
-#     do_tasks_after_request();
-#
-#     return $result;
-# };
+use Encode qw/ encode_utf8 decode_utf8 /;

-1;
\ No newline at end of file
+override 'run' => sub {
+    my($self, $path, $args) = @_;
+    foreach my $k (keys %$args) {
+        my $v = $args->get($k);
+        $v=decode_utf8($v);
+        $args->set($k, $v);
+    }
+    my $result = super();
+    my( $ctype, $charset ) = $self->res->headers->content_type_charset;
+    if( ! $ctype ){
+        $ctype = 'text/html';
+        $charset = 'UTF-8';
+        $self->res->content_type( "$ctype; $charset");
+        $result->output( encode_utf8(''.( $result->output())) );
+    } elsif( ! $charset and $ctype =~ m{text/(?:plain|html)} ){
+        $charset = 'UTF-8';
+        $self->res->content_type( "$ctype; $charset");
+        $result->output( encode_utf8(''.( $result->output())) );
+    }
+    return $result;
+};
+1;
Burse answered 21/5, 2015 at 5:2 Comment(2)
So, you recommends in the run checking the existence of the charset and content-type, if doesn't already set - set it to utf8 and do the encode (if the content-type is plain or html). Ok, this logic is acceptable util you don't want serve other mime-type, like text/css. Rewrite the run every-time, when serving an new mime-type isn't very dev-friendly. Hm, maybe, i could add another (3rd) elseif what should say: if the charset is already set - do the encode for any mime-type and this will indicate the need of the encode. Acceptable - but will wait until the bounty ends. ;)Auto
good point, probably /^text/ is generic enough in addition to another elsif ... might also be a good idea to make sure $path is unicode like my $result = $self->SUPER::run( decode_utf8( $path ), $args );Burse

© 2022 - 2024 — McMap. All rights reserved.