unicode table information about a character in python
Asked Answered
B

3

3

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)

Example: for the letter "Ȅ"

  • Name > Latin Capital Letter E with Double Grave
  • Unicode number > U+0204
  • HTML-code > Ȅ
  • Bloc > Latin Extended-B
  • Lowercase > ȅ

What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").

Roughly:
input = a Unicode number
output = corresponding information

The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.

Thank you.

Bromo answered 2/1, 2018 at 9:25 Comment(1)
Does unicodedata suffice? From the looks of it, it does not tell everything about each code point, but it's sure a lot.Antirrhinum
A
5

The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.

Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.

Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.

Code (tested with Python 2.7 and 3.6):

# -*- coding: utf-8 -*-

class UnicodeCharacter:
    def __init__(self):
        self.code = 0
        self.name = 'unnamed'
        self.category = ''
        self.combining = ''
        self.bidirectional = ''
        self.decomposition = ''
        self.asDecimal = None
        self.asDigit = None
        self.asNumeric = None
        self.mirrored = False
        self.uc1Name = None
        self.comment = ''
        self.uppercase = None
        self.lowercase = None
        self.titlecase = None
        self.block = None

    def __getitem__(self, item):
        return getattr(self, item)

    def __repr__(self):
        return '{'+self.name+'}'

class UnicodeBlock:
    def __init__(self):
        self.first = 0
        self.last = 0
        self.name = 'unnamed'

    def __repr__(self):
        return '{'+self.name+'}'

class BlockList:
    def __init__(self):
        self.blocklist = []
        with open('Blocks.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.split(';')
                    block = UnicodeBlock()
                    block.name = rawdata[1].strip()
                    rawdata = rawdata[0].split('..')
                    block.first = int(rawdata[0],16)
                    block.last = int(rawdata[1],16)
                    self.blocklist.append(block)
            # make 100% sure it's sorted, for quicker look-up later
            # (it is usually sorted in the file, but better make sure)
            self.blocklist.sort (key=lambda x: block.first)

    def lookup(self,code):
        for item in self.blocklist:
            if code >= item.first and code <= item.last:
                return item.name
        return None

class UnicodeList:
    """UnicodeList loads Unicode data from the external files
    'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org

    These files must appear in the same directory as this program.

    UnicodeList is a new interpretation of the standard library
    'unicodedata'; you may first want to check if its functionality
    suffices.

    As UnicodeList loads its data from an external file, it does not depend
    on the local build from Python (in which the Unicode data gets frozen
    to the then 'current' version).

    Initialize with

        uclist = UnicodeList()
    """
    def __init__(self):

        # we need this first
        blocklist = BlockList()
        bpos = 0

        self.codelist = []
        with open('UnicodeData.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.strip().split(';')
                    parsed = UnicodeCharacter()
                    parsed.code = int(rawdata[0],16)
                    parsed.characterName = rawdata[1]
                    parsed.category = rawdata[2]
                    parsed.combining = rawdata[3]
                    parsed.bidirectional = rawdata[4]
                    parsed.decomposition = rawdata[5]
                    parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
                    parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
                    # the following value may contain a slash:
                    #  ONE QUARTER ... 1/4
                    # let's make it Python 2.7 compatible :)
                    if '/' in rawdata[8]:
                        rawdata[8] = rawdata[8].replace('/','./')
                        parsed.asNumeric = eval(rawdata[8])
                    else:
                        parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
                    parsed.mirrored = rawdata[9] == 'Y'
                    parsed.uc1Name = rawdata[10]
                    parsed.comment = rawdata[11]
                    parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
                    parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
                    parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
                    while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
                        bpos += 1
                    parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
                    self.codelist.append(parsed)

    def find_code(self,codepoint):
        """Find the Unicode information for a codepoint (as int).

        Returns:
            a UnicodeCharacter class object or None.
        """
        # the list is unlikely to contain duplicates but I have seen Unicode.org
        # doing that in similar situations. Again, better make sure.
        val = [x for x in self.codelist if codepoint == x.code]
        return val[0] if val else None

    def find_char(self,str):
        """Find the Unicode information for a codepoint (as character).

        Returns:
            for a single character: a UnicodeCharacter class object or
            None.
            for a multicharacter string: a list of the above, one element
            per character.
        """
        if len(str) > 1:
            result = [self.find_code(ord(x)) for x in str]
            return result
        else:
            return self.find_code(ord(str))

When loaded, you can now look up a character code with

>>> ul = UnicodeList()     # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}

which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:

>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B

and (as long as you don't get a None) even chain them:

>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}

It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:

import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}

(As tested with Python 3.5.3.)

There are currently two lookup functions defined:

  • find_code(int) looks up character information by codepoint as an integer.
  • find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.

After import unicodelist (assuming you saved this as unicodelist.py), you can use

>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'

to look up the hex code for any character, and a list comprehension such as

>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']

for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:

 l = [hex(ord(x)) for x in 'Hello']

The purpose of this module is to give easy access to other Unicode properties. A longer example:

str = 'Héllo...'
dest = ''
for i in str:
    dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)

HÉLLO...

and showing a list of properties for a character per your example:

letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))

(I left out HTML; these names are not defined in the Unicode standard.)

Antirrhinum answered 2/1, 2018 at 11:50 Comment(7)
Note to self: I'm not that confident yet about the code in lookup :( It looks fairly un-pythonic to my (untrained) eyes. ... it's better than a loop, though.Antirrhinum
Is it possible to run this code on a whole tex converting the letters to their hex value on the fly ? Because hex(ord(u'è')) returns an str type but uclist.lookup takes an int as an argument.Bromo
@LilithM: no, that needs an entirely new program. But converting a single integer to hex in Python should not be a problem.Antirrhinum
converting an int to hex is no problem but I was looking for a way to convert a letter to it's hex value and get an int. Because hex(ord()) produces an str.Bromo
@LilithM: ah, you want the same functionality. It's only a matter of adding the appropriate function(s) to (off the top of my head) class UnicodeCharacter. I'll get back to you on that.Antirrhinum
@LilithM: okay, added a specific lookup for characters and strings – find_char. I must note, though, that this module is not helpful to convert a string into hex! (Nor was it meant to.) As per your original question, I wrote it to look up all possible data for a single character. I'll add a new example, though.Antirrhinum
this is great, thank you. Yes I'm sorry I didn't specify it was meant to be working on a whole text, but I'm learning already so much form your code so thank you !Bromo
C
3

The unicodedata documentation shows how to do most of this.

The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.

The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.

So in summary

>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'&#203;'
>>> 'Ë'.lower()
'ë'

The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).

Corrugation answered 2/1, 2018 at 11:25 Comment(0)
A
-1

You can do this in some ways :

1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that

I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.

Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

Abdication answered 2/1, 2018 at 9:46 Comment(2)
For something quick I'll use tripleee's suggestion but ultimately I think the 3rd way is the way to go. Thank you very much.Bromo
No need to parse 3rd party web pages if the data is already freely accessible through unicode.org.Antirrhinum

© 2022 - 2024 — McMap. All rights reserved.