How can I handle unicode with Perl's DBI?
Asked Answered
J

6

19

My delicious-to-wp perl script works but gives for all "weird" characters even weirder output. So I tried

$description = decode_utf8( $description ); 

but that doesnt make a difference. I would like e.g. “go live” to become “go live” and not “go live†How can I handle unicode in Perl so that this works?

UPDATE: I found the problem was to set utf of DBI I had to set in Perl:

my $sql = qq{SET NAMES 'utf8';};
$dbh->do($sql);

That was the part that I had to set, tricky. Thanks!

Jive answered 11/6, 2009 at 21:22 Comment(1)
Distill your code to the shortest possible script that still exhibits the problem. Take the database out of the equation to figure out if the problem has anything to do with Perl. Come up with something others can test and debug.Cantilena
B
3

It may have nothing to do with Perl. Check to make sure you're using UTF encodings in the pertinent MySQL table columns.

Boorer answered 11/6, 2009 at 21:26 Comment(0)
R
23

It's worth noting that if you're running a version of DBD::mysql new enough (3.0008 on), you can do the following: $dbh->{'mysql_enable_utf8'} = 1; and then everything's decode()ed/encode()ed for you on the way out from/in to DBI.

Ranchman answered 12/6, 2009 at 7:13 Comment(3)
For the lazy like myself, this is the dbic variant: Schema->connect("dbi:mysql:".$db{db}, $db{user}, $db{pass}, {mysql_enable_utf8 => 1} );Conspicuous
Solution in the answer doesn't work for me but solution in the comment above by @Conspicuous works like a charm.Bangkok
According to the documentation it can be done during the connect, which requires to further actions. If you do it post connect as suggested in the answer you also have to do a an extra command: SET NAMES utf8.Nephelinite
P
14

Enable UTF8, when you connect to database like this:

my $dbh = DBI->connect(
    "dbi:mysql:dbname=db_name", 
    "db_user", "db_pass",
     {RaiseError => 0, PrintError => 0, mysql_enable_utf8 => 1}
 ) or die "Connect to database failed.";

This should get you character mode strings with the UTF8 flag set as needed.

From DBI General Interface Rules & Caveats:

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.

And the specifics from DBD::mysql for mysql_enable_utf8

Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.

Panelboard answered 20/6, 2012 at 6:23 Comment(1)
Thank you for this. I tried updating the Content-Type and charset.... I pondered server side encoding and decoding.... all that seemed like perhaps I was missing something. My issue was the single apostrophe in an input field. Worked fine on my laptop, but an entry from my iphone (Both devices using Chrome by the way) I'd get a funky character that even existed in the database. But Hooray for your answer. That simple fix on my get_database_handle() routine solved my issue. Kudos to you.Catatonia
V
6

The term

$dbh->do(qq{SET NAMES 'utf8';});

definitely saves the day for accessing an utf-8 declared database, but take notice, if you are going to do any perl processing of any data obatined from the db it would be wise to store it in a perl var as an utf8 string with, as this operation is not implicit.

$utfstring = decode('utf8',$string_from_db);

of course, for proper i/o handling of utf8 strings (reading, printing, writing to output) remember to set

use open ':utf8';

and

binmode STDOUT, ":utf8";

the latter being essential for printing out utf8 strings. Hope this helps.

Verona answered 6/1, 2011 at 12:30 Comment(3)
How to get decode function? Perl shows error, saying cannot find the decode() routine. Thanks!Burgess
This fixed my problem..Have a perl cgi script that connects to a Postgres utf-8 database that returns a JSON string to the client using XHR. Characters were showing up on the client side all garbled until I added my $final_utf8 = Encode::encode_utf8($treeJSON); print CGI->header('application/json;charset=UTF-8'); print "$final_utf8"; as suggested in your post. Remember you need use Encode; at the top of your perl script to use this.Affinity
None of the answers up to this one couldn't solve my UTF8 DB problem but this one did. Thanks!Escapism
B
3

It may have nothing to do with Perl. Check to make sure you're using UTF encodings in the pertinent MySQL table columns.

Boorer answered 11/6, 2009 at 21:26 Comment(0)
N
1

Leave this öne out:

binmode STDOUT, ":utf8";

when using:

$dbh->do(qq{SET NAMES 'utf8';});

Otherwise your output will have double utf8 encoding, resulting in unreadable double byte characters! It took me a couple of hours to figure this out..

Norfolk answered 12/5, 2011 at 17:5 Comment(2)
If what you say is true then Perl does not know the data retrieved from the database was UTF-8 encoded and simply thinks it is bytes. Surely DBD::Mysql (or whatever DBD you are using) is decoding the data from the database for you in which case Perl should know you have Unicode characters and when they are written to stdout with the utf8 layer all will be well. If what you describe is what happens I'd suggest your DBD is broken. However, looking at latest DBD::Mysql it does do "sv_utf8_decode(sv);" at least on some data.Nonproductive
bohica, DBD::mysql normally provides bytes, implicit decoding is not turned on by default.Turfy
D
0

By default, the driver Perl/MySQL handles binary data (at least I concluded this from some experiments with MySQL 5.1 and 5.5).

Without setting mysql_enable_utf8, I encoded/decoded the strings to/from UTF-8 before writing/reading to/from the database.

It should not be relied upon the perl-internal string representation as an array of byte; be aware that the internal 'utf8' is not guaranteed to be standard UTF-8; in converse, the single byte encoding is not guaranteed to be ISO-8859-1; really do encode/decode to/from UTF-8 (and not 'utf8').

There are also some settings of MySQL (like SET NAMES above, as far as I remember there is a client encoding, a connection encoding, and a server encoding, whose interactions are not quite clear to me if they do not all have the same value) regarding to the encodings; setting all of them to UTF-8, and the recipe above, worked for me.

Dunlin answered 11/2, 2016 at 10:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.