In Python, how to list all characters matched by POSIX extended regex [:space:]
?
Is there a programmatic way of extracting the Unicode code points covered by [:space:]
?
In Python, how to list all characters matched by POSIX extended regex [:space:]
?
Is there a programmatic way of extracting the Unicode code points covered by [:space:]
?
Using a generator instead of a list comprehension, and xrange
instead of range
:
>>> s = u''.join(unichr(c) for c in xrange(0x10ffff+1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
Whoops: in general use sys.maxunicode
.
>>> s = u''.join(unichr(c) for c in xrange(sys.maxunicode+1))
>>> import re
>>> re.findall(r'\s', s)
[u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u' ']
Whoops: Ummm what about "no-break space" etc?
>>> re.findall(r'\s', s, re.UNICODE)
[u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u'\x1c', u'\x1d', u'\x1e', u'\x1f', u' '
, u'\x85', u'\xa0', u'\u1680', u'\u180e', u'\u2000', u'\u2001', u'\u2002', u'\u2
003', u'\u2004', u'\u2005', u'\u2006', u'\u2007', u'\u2008', u'\u2009', u'\u200a
', u'\u2028', u'\u2029', u'\u202f', u'\u205f', u'\u3000']
What is all that stuff? unicodedata.name
is your friend:
>>> from unicodedata import name
>>> for c in re.findall(r'\s', s, re.UNICODE):
... print repr(c), name(c, '')
...
u'\t'
u'\n'
u'\x0b'
u'\x0c'
u'\r'
u'\x1c'
u'\x1d'
u'\x1e'
u'\x1f'
u' ' SPACE
u'\x85'
u'\xa0' NO-BREAK SPACE
u'\u1680' OGHAM SPACE MARK
u'\u180e' MONGOLIAN VOWEL SEPARATOR
u'\u2000' EN QUAD
u'\u2001' EM QUAD
u'\u2002' EN SPACE
u'\u2003' EM SPACE
u'\u2004' THREE-PER-EM SPACE
u'\u2005' FOUR-PER-EM SPACE
u'\u2006' SIX-PER-EM SPACE
u'\u2007' FIGURE SPACE
u'\u2008' PUNCTUATION SPACE
u'\u2009' THIN SPACE
u'\u200a' HAIR SPACE
u'\u2028' LINE SEPARATOR
u'\u2029' PARAGRAPH SEPARATOR
u'\u202f' NARROW NO-BREAK SPACE
u'\u205f' MEDIUM MATHEMATICAL SPACE
u'\u3000' IDEOGRAPHIC SPACE
It'd be a bit hard as Python doesn't support POSIX character classes.
The PyPI regex module does, however (you have to install it yourself).
The only way I can think of to extract all unicodes that match [[:space:]]
is a bit ugly:
[[:space:]]
.I'm sure there's a better way to generate stri
(the string of all unicode characters) in my code below, so open to improvement there!
chrs = [unichr(c) for c in range(0x10ffff+1)] # <-- eww that's not very fast!
# also we go up to 0x10ffff (inclusive) because that's what help(unichr) says.
stri = ''.join(chrs)
import re
# example if we wanted things matching `\s` with `re` module:
re.findall('\s',stri)
# --> [u'\t', u'\n', u'\x0b', u'\x0c', u'\r', u' ']
# If i had the regex module...
# regex.findall("[[:space:]]",stri)
(edit - modified variable name from str
to stri
to avoid overwriting the in-built str
module(!))
help(unichr)
says unichr(i)
valid for 0 <= i <= 0x10ffff
so there's no issue that it's hard-coded. The only qualm I have is that it seems a waste to spend ages generating a list only to convert it into a flat string, and the generation of the list (chrs
) seems to take a noticeable second or two - I just wonder if there's an equivalent of string.ascii_letters
for unicode. –
Exigent To update the answer for Python 3:
import re
import sys
s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))
>>> ws.isspace()
True
Here's the unicode datapoint characters found:
>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
And we see these are all regarded as whitespace by the str.strip()
method:
>>> len(ws.strip())
0
Here's some more information on the characters:
from unicodedata import name, category
for char in ws:
print(hex(ord(char)), repr(char), category(char), name(char, None))
In Python 3.5, for me, prints:
0x9 '\t' Cc None
0xa '\n' Cc None
0xb '\x0b' Cc None
0xc '\x0c' Cc None
0xd '\r' Cc None
0x1c '\x1c' Cc None
0x1d '\x1d' Cc None
0x1e '\x1e' Cc None
0x1f '\x1f' Cc None
0x20 ' ' Zs SPACE
0x85 '\x85' Cc None
0xa0 '\xa0' Zs NO-BREAK SPACE
0x1680 '\u1680' Zs OGHAM SPACE MARK
0x2000 '\u2000' Zs EN QUAD
0x2001 '\u2001' Zs EM QUAD
0x2002 '\u2002' Zs EN SPACE
0x2003 '\u2003' Zs EM SPACE
0x2004 '\u2004' Zs THREE-PER-EM SPACE
0x2005 '\u2005' Zs FOUR-PER-EM SPACE
0x2006 '\u2006' Zs SIX-PER-EM SPACE
0x2007 '\u2007' Zs FIGURE SPACE
0x2008 '\u2008' Zs PUNCTUATION SPACE
0x2009 '\u2009' Zs THIN SPACE
0x200a '\u200a' Zs HAIR SPACE
0x2028 '\u2028' Zl LINE SEPARATOR
0x2029 '\u2029' Zp PARAGRAPH SEPARATOR
0x202f '\u202f' Zs NARROW NO-BREAK SPACE
0x205f '\u205f' Zs MEDIUM MATHEMATICAL SPACE
0x3000 '\u3000' Zs IDEOGRAPHIC SPACE
© 2022 - 2024 — McMap. All rights reserved.
\s
matches ` \t\n\r\f\v`. – Castileunicodedata
module sadly doesn't offer a facility for enumerating or iterating over a set of code points, certainly not by property. – Arouse