How to detect latin1 and UTF-8?

#!/usr/bin/perl use warnings; use strict; use Encode qw(decode encode); use Data::Dumper; my $x = "m\x{e6}gtig"; my $y = "m\x{c3}\x{a6}gtig"; my $a = encode('UTF-8', $x); my $b = encode('UTF-8', $y); print Dumper $x; print Dumper $y; print Dumper $a; print Dumper $b; if ($x eq $y) { print "1\n"; } if ($x eq $a) { print "2\n"; } if ($a eq $y) { print "3\n"; } if ($a eq $b) { print "4\n"; } if ($x eq $b) { print "5\n"; } if ($y eq $b) { print "6\n"; }

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings^[1].

As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.

utf8:: implementation:

my $decoded_text = $utf8_or_latin1;
utf8::decode($decoded_text);

Encode:: implementation:

use Encode qw( decode_utf8 );

my $decoded_text =
   eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
      // $utf8_or_latin1;

Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.

utf8:: implementation:

my $utf8 = $decoded_text;
utf8::encode($utf8);

Encode:: implementation:

use Encode qw( encode_utf8 );

my $utf8 = encode_utf8($decoded_text);

Notes

Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:
- The text is encoded using iso-8859-1 (as opposed to UTF-8),
- At least one of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
  àáâãäåæçèéêëìíîïðñòóôõö÷
  ] is present,
- All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are followed by one of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [àáâãäåæçèéêëìíîï] are followed by two of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [ðñòóôõö÷] are followed by three of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- None of [øùúûüýþÿ] are present, and
- None of [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ] are present except where previously mentioned.
(<80>..<9F> are unassigned or unprintable control characters, not sure which.)

In other words, that code is very reliable.

Recommended topics

Hot tags