Efficiently list all characters in a given Unicode category

About

Asked 9/1, 2013 at 20:30 Answered 9/1, 2013 at 20:38

Often one wants to list all characters in a given Unicode category. For example:

It is possible to produce this list by iterating over all Unicode code-points and testing for the desired category (Python 3):

[c for c in map(chr, range(0x110000)) if unicodedata.category(c) in ('Ll',)]

or using regexes,

re.findall(r'\s', ''.join(map(chr, range(0x110000))))

But these methods are slow. Is there a way to look up a list of characters in the category without having to iterate over all of them?

Northeaster answered 9/1, 2013 at 20:30 Comment(1)

By the way, if you just want to see which characters are in which categories, I made a page with all the characters. – Wallboard 16/12, 2020 at 20:1

If you need to do this often, it's easy enough to build yourself a re-usable map:

import sys
import unicodedata
from collections import defaultdict

unicode_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_category[unicodedata.category(c)].append(c)

And from there on out use that map to translate back to a series of characters for a given category:

alphabetic = unicode_category['Ll']

If this is too costly for start-up time, consider dumping that structure to a file; loading this mapping from a JSON file or other quick-to-parse-to-dict format should not be too painful.

Once you have the mapping, looking up a category is done in constant time of course.

Sewerage answered 9/1, 2013 at 20:38 Comment(1)

@m.kocikowski: unless you are using Python 3, which the OP of the question clearly is (it'd fail in Python 2 otherwise). – Sewerage 4/11, 2014 at 11:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags