'\u00BD' # ½
'\u00B2' # ²
I am trying to understand isdecimal() and isdigit() better, for this its necessary to understand unicode numeric value properties. How would I see the numerical value property of, for example, the above two unicodes.
'\u00BD' # ½
'\u00B2' # ²
I am trying to understand isdecimal() and isdigit() better, for this its necessary to understand unicode numeric value properties. How would I see the numerical value property of, for example, the above two unicodes.
To get the 'numeric value' contained in the character, you could use unicodedata.numeric()
function:
>>> import unicodedata
>>> unicodedata.numeric('\u00BD')
0.5
Use the ord()
function to get the integer codepoint, optionally in combination with format()
to produce a hexadecimal value:
>>> ord('\u00BD')
189
>>> format(ord('\u00BD'), '04x')
'00bd'
You can get access to the character property with unicodedata.category()
, which you'd then need to check against the documented categories:
>>> unicodedata('\u00DB')
'No'
where 'No'
stands for Number, Other.
However, there are a series of .isnumeric() == True
characters in the category Lo
; the Python unicodedata
database only gives you access to the general category and relies on str.isdigit()
, str.isnumeric()
, and unicodedata.digit()
, unicodedata.numeric()
, etc. methods to handle the additional categories.
If you want a precise list of all numeric Unicode characters, the canonical source is the Unicode database; a series of text files that define the whole of the standard. The DerivedNumericTypes.txt
file (v. 6.3.0) gives you a 'view' on that database specific the numeric properties; it tells you at the top how the file is derived from other data files in the standard. Ditto for the DerivedNumericValues.txt
file, listing the exact numeric value per codepoint.
unicodedata.category('\u00DB') == 'Lu'
, not No
(it would be true for '\u00BD'). format(ord('\u00BD'), '04x')
seems unrelated to the question –
Feleciafeledy the docs explicitly specify the relation between the methods and Numeric_Type
property.
def is_decimal(c):
"""Whether input character is Numeric_Type=decimal."""
return c.isdecimal() # it means General Category=Decimal Number in Python
def is_digit(c):
"""Whether input character is Numeric_Type=digit."""
return c.isdigit() and not c.isdecimal()
def is_numeric(c):
"""Whether input character is Numeric_Type=numeric."""
return c.isnumeric() and not c.isdigit() and not c.isdecimal()
Example:
>>> for c in '\u00BD\u00B2':
... print("{}: Numeric: {}, Digit: {}, Decimal: {}".format(
... c, is_numeric(c), is_digit(c), is_decimal(c)))
...
½: Numeric: True, Digit: False, Decimal: False
²: Numeric: False, Digit: True, Decimal: False
I'm not sure Decimal Number and Numeric_Type=Decimal will always be identical.
Note: '\u00B2'
is not decimal because superscripts are explicitly excluded by the standard, see 4.6 Numerical Value (Unicode 6.2).
Decimal
. Can you come up with a third example? –
Pulsation is_digit('0')
is False
–
Pulsation '0'
has a property Numeric_Type=decimal
(decimal digit). is_digit(c)
returns whether Numeric_Type=digit
(decimal, but in typographic context e.g., ①
) —they are mutually exclusive. What characters have which Numeric_Type
is defined in the Unicode standard. –
Feleciafeledy © 2022 - 2024 — McMap. All rights reserved.