How to convert between the unicode forms: string, name, number

Asked 9/12, 2021 at 16:44 Answered 9/12, 2021 at 21:32

I have been lately using unicode more often and wondered if there is a command line tool to convert unicode between its forms.

Would be nice to be able to say:

uni_convert "☃" --string

And know that the string is defined in unicode as a "SNOWMAN".

Isocracy answered 9/12, 2021 at 16:44 Comment(0)

Perl's Unicode-Tussle distribution comes with the useful uniprops.

$ uniprops '☃'
U+2603 ‹☃› \N{SNOWMAN}
...

$ uniprops 'U+2603'
U+2603 ‹☃› \N{SNOWMAN}
...

$ uniprops 'SNOWMAN'
U+2603 ‹☃› \N{SNOWMAN}
...

If you're writing code, you'll want charnames.

Want	Have	Code
`$code`	`$char`	`ord($char)`
`$code`	`$name`	`charnames::vianame($name)`
`$char`	`$code`	`chr($code)`
`$char`	`$name`	`chr(charnames::vianame($name))`
`$name`	`$code`	`charnames::viacode($code)`
`$name`	`$char`	`charnames::viacode(ord($char))`

vianame accepts official aliases (e.g. LF for LINEFEED). You'll need to parse U+ notation yourself if wish to accept it. ($code = hex(s/^U\+//r);)

Example:

use strict;
use warnings;
use feature      qw( say );
use experimental qw( regex_sets );     # Safe. Optional since 5.36.

use utf8;                              # Source encoded using UTF-8.
use open ":std", ":encoding(UTF-8)";   # Terminal provides/expects UTF-8.

use charnames qw( :full );
use Encode    qw( decode_utf8 );

@ARGV == 1
   or die("usage\n");

my $s = decode_utf8($ARGV[0]);

for my $cp ( unpack "W*", $s ) {
   my $ch = chr($cp);
   if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) {   # Not sure if good enough.
      printf "‹%s› ", $ch;
   } else {
      print "--- ";
   }

   printf "U+%X ", $cp;

   say charnames::viacode($cp);
}

$ uni_id ☃
‹☃› U+2603 SNOWMAN

$ uni_id çà
‹ç› U+E7 LATIN SMALL LETTER C WITH CEDILLA
‹à› U+E0 LATIN SMALL LETTER A WITH GRAVE

Other resources:

Unicode::UCD

Provides accsess at the information found in the Unicode Character Database.
The Unicode Standard is more than characters and properties.
perluniprops
unichars from Unicode-Tussle (e.g. unichars '\p{Hiragana}')

Pestilential answered 9/12, 2021 at 21:32 Comment(0)

I separated the code into a file and created a repo: https://github.com/poti1/uni_convert

Isocracy answered 9/12, 2021 at 16:44 Comment(4)

Why not just use an actual perl script file instead of a huge one-liner wrapped in a shell function? – Luminosity 9/12, 2021 at 16:55

Oh, and Term::ANSIColor is useful instead of hardcoded escape sequences. – Luminosity 9/12, 2021 at 16:57

Instead of having a file per script or function, I tend to add these to my bashrc. When the script is big enough (like now 😅), I'd move it to a separate file. – Isocracy 9/12, 2021 at 17:9

I've seen Term::ANSIColor used by others. I guess its better than using the escape characters? – Isocracy 9/12, 2021 at 17:10

Here is an awk to do that.

Download this file from unicode.org that provides the latest names.

Then:

q=$(printf '%x\n' \'☃)
awk '/^[[:xdigit:]]+/{
    str=$0
    sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
    names[$1]=str
}
END{ print names[q] }
' q="$q" names.txt

Prints:

SNOWMAN

If you want to go the other way:

cp=$(awk '/^[[:xdigit:]]+/{
    str=$0
    sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
    other_names[str]=$1
}
END{ print other_names[q] }
' q="SNOWMAN" names.txt)

echo -e "\u${cp}"

Prints:

☃

If you have GNU awk you can easily convert the hex index into decimal and can print from within. This allows a single source file to be used and go one way or the other by defining q or r:

gawk '/^[[:xdigit:]]+/{
    str=$0
    sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
    names[$1]=str
    other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' r='SNOWMAN' names.txt
☃

gawk '/^[[:xdigit:]]+/{
    str=$0
    sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
    names[$1]=str
    other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' q=$(printf '%x\n' \'☃) names.txt
SNOWMAN

Exorbitant answered 9/12, 2021 at 17:8 Comment(1)

Re "Download this file from unicode.org that provides the latest names.", Don't forget the aliases – Pestilential 9/12, 2021 at 21:35

Recommended topics

Hot tags