In Python, how to list all characters matched by POSIX extended regex `[:space:]`?
Asked Answered
H

3

11

In Python, how to list all characters matched by POSIX extended regex [:space:]?

Is there a programmatic way of extracting the Unicode code points covered by [:space:]?

Hypnology answered 19/1, 2012 at 5:16 Comment(4)
Are you using a specific module? \s matches ` \t\n\r\f\v`.Castile
What do you need this information for? If it's just curiosity, you can grep the Unicode database for all characters matching the whitespace property. The Python unicodedata module sadly doesn't offer a facility for enumerating or iterating over a set of code points, certainly not by property.Arouse
@Problemaniac, the github link is brokenAmmadas
@Ammadas I added code explicitly.Hypnology
D
19

Using a generator instead of a list comprehension, and xrange instead of range:

>>> s = u''.join(unichr(c) for c in xrange(0x10ffff+1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Whoops: in general use sys.maxunicode.

>>> s = u''.join(unichr(c) for c in xrange(sys.maxunicode+1))
>>> import re
>>> re.findall(r'\s', s)
[u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u' ']

Whoops: Ummm what about "no-break space" etc?

>>> re.findall(r'\s', s, re.UNICODE)
[u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u'\x1c', u'\x1d', u'\x1e', u'\x1f', u' '
, u'\x85', u'\xa0', u'\u1680', u'\u180e', u'\u2000', u'\u2001', u'\u2002', u'\u2
003', u'\u2004', u'\u2005', u'\u2006', u'\u2007', u'\u2008', u'\u2009', u'\u200a
', u'\u2028', u'\u2029', u'\u202f', u'\u205f', u'\u3000']

What is all that stuff? unicodedata.name is your friend:

>>> from unicodedata import name
>>> for c in re.findall(r'\s', s, re.UNICODE):
...     print repr(c), name(c, '')
...
u'\t'
u'\n'
u'\x0b'
u'\x0c'
u'\r'
u'\x1c'
u'\x1d'
u'\x1e'
u'\x1f'
u' ' SPACE
u'\x85'
u'\xa0' NO-BREAK SPACE
u'\u1680' OGHAM SPACE MARK
u'\u180e' MONGOLIAN VOWEL SEPARATOR
u'\u2000' EN QUAD
u'\u2001' EM QUAD
u'\u2002' EN SPACE
u'\u2003' EM SPACE
u'\u2004' THREE-PER-EM SPACE
u'\u2005' FOUR-PER-EM SPACE
u'\u2006' SIX-PER-EM SPACE
u'\u2007' FIGURE SPACE
u'\u2008' PUNCTUATION SPACE
u'\u2009' THIN SPACE
u'\u200a' HAIR SPACE
u'\u2028' LINE SEPARATOR
u'\u2029' PARAGRAPH SEPARATOR
u'\u202f' NARROW NO-BREAK SPACE
u'\u205f' MEDIUM MATHEMATICAL SPACE
u'\u3000' IDEOGRAPHIC SPACE
Duchess answered 19/1, 2012 at 8:4 Comment(0)
E
3

It'd be a bit hard as Python doesn't support POSIX character classes.

The PyPI regex module does, however (you have to install it yourself).

The only way I can think of to extract all unicodes that match [[:space:]] is a bit ugly:

  • generate a string of all unicode characters
  • match against [[:space:]].

I'm sure there's a better way to generate stri (the string of all unicode characters) in my code below, so open to improvement there!

chrs = [unichr(c) for c in range(0x10ffff+1)] # <-- eww that's not very fast!
# also we go up to 0x10ffff (inclusive) because that's what help(unichr) says.
stri = ''.join(chrs)

import re
# example if we wanted things matching `\s` with `re` module:
re.findall('\s',stri)
# --> [u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u' ']

# If i had the regex module...
# regex.findall("[[:space:]]",stri)

(edit - modified variable name from str to stri to avoid overwriting the in-built str module(!))

Exigent answered 19/1, 2012 at 6:27 Comment(2)
Then where would you like the range to end? It's problematic in that not all code points in the range are valid, though.Arouse
Well, help(unichr) says unichr(i) valid for 0 <= i <= 0x10ffff so there's no issue that it's hard-coded. The only qualm I have is that it seems a waste to spend ages generating a list only to convert it into a flat string, and the generation of the list (chrs) seems to take a noticeable second or two - I just wonder if there's an equivalent of string.ascii_letters for unicode.Exigent
A
2

To update the answer for Python 3:

import re
import sys

s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))
>>> ws.isspace()
True

Here's the unicode datapoint characters found:

>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'

And we see these are all regarded as whitespace by the str.strip() method:

>>> len(ws.strip())
0

Here's some more information on the characters:

from unicodedata import name, category
for char in ws:
    print(hex(ord(char)), repr(char), category(char), name(char, None))

In Python 3.5, for me, prints:

0x9 '\t' Cc None
0xa '\n' Cc None
0xb '\x0b' Cc None
0xc '\x0c' Cc None
0xd '\r' Cc None
0x1c '\x1c' Cc None
0x1d '\x1d' Cc None
0x1e '\x1e' Cc None
0x1f '\x1f' Cc None
0x20 ' ' Zs SPACE
0x85 '\x85' Cc None
0xa0 '\xa0' Zs NO-BREAK SPACE
0x1680 '\u1680' Zs OGHAM SPACE MARK
0x2000 '\u2000' Zs EN QUAD
0x2001 '\u2001' Zs EM QUAD
0x2002 '\u2002' Zs EN SPACE
0x2003 '\u2003' Zs EM SPACE
0x2004 '\u2004' Zs THREE-PER-EM SPACE
0x2005 '\u2005' Zs FOUR-PER-EM SPACE
0x2006 '\u2006' Zs SIX-PER-EM SPACE
0x2007 '\u2007' Zs FIGURE SPACE
0x2008 '\u2008' Zs PUNCTUATION SPACE
0x2009 '\u2009' Zs THIN SPACE
0x200a '\u200a' Zs HAIR SPACE
0x2028 '\u2028' Zl LINE SEPARATOR
0x2029 '\u2029' Zp PARAGRAPH SEPARATOR
0x202f '\u202f' Zs NARROW NO-BREAK SPACE
0x205f '\u205f' Zs MEDIUM MATHEMATICAL SPACE
0x3000 '\u3000' Zs IDEOGRAPHIC SPACE
Armindaarming answered 19/6, 2016 at 2:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.