How to detect latin1 and UTF-8?
Asked Answered
S

1

4

I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not. My idea was to

#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;

my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";

my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);

print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;

if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }

outputs

$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
3

under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer. So I can't detect latin1 vs UTF-8 that way.

Question

I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?

Being able to get a yes/no if a string is UTF-8 would be just as useful.

Sang answered 4/4, 2014 at 16:33 Comment(7)
Do you want a solution to guess what´s the correct charset or do you want something accurate? Bacause, latter is not possible.Annieannihilate
If it is not possible to do it accurately, then guessing it better than nothing =)Sang
@deviantfan, Guessing is very accurate. See the footnote in my answer.Ballentine
@ikegami: It´s still guessing. I´m not saying this is bad, but that won´t change the fact.Annieannihilate
@deviantfan, You seem to have misread something. I never said it wasn't guessing.Ballentine
@ikegami: I´m not pretending anything? I didn´t meant it in any bad way, if you understood it so.Annieannihilate
Can't you avoid all this by going back to whoever is supplying you with this data and asking them to provide valid UTF8?Ericaericaceous
B
10

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings[1].

As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.

  • utf8:: implementation:

    my $decoded_text = $utf8_or_latin1;
    utf8::decode($decoded_text);
    
  • Encode:: implementation:

    use Encode qw( decode_utf8 );
    
    my $decoded_text =
       eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
          // $utf8_or_latin1;
    

Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.

  • utf8:: implementation:

    my $utf8 = $decoded_text;
    utf8::encode($utf8);
    
  • Encode:: implementation:

    use Encode qw( encode_utf8 );
    
    my $utf8 = encode_utf8($decoded_text);
    

Notes

  1. Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:

    • The text is encoded using iso-8859-1 (as opposed to UTF-8),
    • At least one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
      àáâãäåæçèéêëìíîïðñòóôõö÷
      ] is present,
    • All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are followed by one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [àáâãäåæçèéêëìíîï] are followed by two of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [ðñòóôõö÷] are followed by three of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • None of [øùúûüýþÿ] are present, and
    • None of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ] are present except where previously mentioned.

    (<80>..<9F> are unassigned or unprintable control characters, not sure which.)

    In other words, that code is very reliable.

Ballentine answered 4/4, 2014 at 17:2 Comment(5)
How come encoding utf8 to utf8 doesn't trash it? It does in my OP.Sang
In your example it works, which i don't understand why it does, as it fails in my.Sang
I don't encode UTF-8 bytes using UTF-8. I don't encode UTF-8 bytes, period. I encode decoded text (Unicode code points) using UTF-8.Ballentine
@ikegami, hi, could you please elaborate on why decoding using iso-8859-1 is a no-op? Why not simply add the following to you Encode:: implementation: my $decoded_text = eval { ... } // decode ("iso-8859-1", $utf8_or_latin1);? ThanksHemimorphic
@Hemimorphic Re "why decoding using iso-8859-1 is a no-op?", Because Unicode is an extension of iso-8851-1. Specifically, iso-8859-1 0 is Code Point 0, 1 is 1, 2 is 2, ..., and FF is FF.Ballentine

© 2022 - 2024 — McMap. All rights reserved.