How to reveal Unicodes numeric value property
Asked Answered
A

2

1
'\u00BD' # ½
'\u00B2' # ²

I am trying to understand isdecimal() and isdigit() better, for this its necessary to understand unicode numeric value properties. How would I see the numerical value property of, for example, the above two unicodes.

Allbee answered 1/4, 2014 at 15:15 Comment(0)
C
7

To get the 'numeric value' contained in the character, you could use unicodedata.numeric() function:

>>> import unicodedata
>>> unicodedata.numeric('\u00BD')
0.5

Use the ord() function to get the integer codepoint, optionally in combination with format() to produce a hexadecimal value:

>>> ord('\u00BD')
189
>>> format(ord('\u00BD'), '04x')
'00bd'

You can get access to the character property with unicodedata.category(), which you'd then need to check against the documented categories:

>>> unicodedata('\u00DB')
'No'

where 'No' stands for Number, Other.

However, there are a series of .isnumeric() == True characters in the category Lo; the Python unicodedata database only gives you access to the general category and relies on str.isdigit(), str.isnumeric(), and unicodedata.digit(), unicodedata.numeric(), etc. methods to handle the additional categories.

If you want a precise list of all numeric Unicode characters, the canonical source is the Unicode database; a series of text files that define the whole of the standard. The DerivedNumericTypes.txt file (v. 6.3.0) gives you a 'view' on that database specific the numeric properties; it tells you at the top how the file is derived from other data files in the standard. Ditto for the DerivedNumericValues.txt file, listing the exact numeric value per codepoint.

Conversant answered 1/4, 2014 at 15:19 Comment(4)
I think OP wants 0.5 and 2 for those code points, not their code point.Tenstrike
@delnan: check, added that too.Conversant
my question may be wrong then - I read about the property values Numeric_Type=Digit, Numeric_Type=Decimal, and Numeric_Type=Numeric I was wondering whether I could produce this property from a unicode point somehow?Allbee
unicodedata.category('\u00DB') == 'Lu', not No (it would be true for '\u00BD'). format(ord('\u00BD'), '04x') seems unrelated to the questionFeleciafeledy
F
1

the docs explicitly specify the relation between the methods and Numeric_Type property.

def is_decimal(c):
    """Whether input character is Numeric_Type=decimal."""
    return c.isdecimal() # it means General Category=Decimal Number in Python

def is_digit(c):
    """Whether input character is Numeric_Type=digit."""
    return c.isdigit() and not c.isdecimal()


def is_numeric(c):
    """Whether input character is Numeric_Type=numeric."""
    return c.isnumeric() and not c.isdigit() and not c.isdecimal()

Example:

>>> for c in '\u00BD\u00B2':
...     print("{}: Numeric: {}, Digit: {}, Decimal: {}".format(
...         c, is_numeric(c), is_digit(c), is_decimal(c)))
... 
½: Numeric: True, Digit: False, Decimal: False
²: Numeric: False, Digit: True, Decimal: False

I'm not sure Decimal Number and Numeric_Type=Decimal will always be identical.

Note: '\u00B2' is not decimal because superscripts are explicitly excluded by the standard, see 4.6 Numerical Value (Unicode 6.2).

Feleciafeledy answered 1/4, 2014 at 17:59 Comment(4)
Neither of the two characters you give is Decimal. Can you come up with a third example?Pulsation
@Pulsation here are all decimal numbers (in the Unicode standard used by python executable)Feleciafeledy
I think I'm confused by how your is_digit('0') is FalsePulsation
@Pulsation '0' has a property Numeric_Type=decimal (decimal digit). is_digit(c) returns whether Numeric_Type=digit (decimal, but in typographic context e.g., ) —they are mutually exclusive. What characters have which Numeric_Type is defined in the Unicode standard.Feleciafeledy

© 2022 - 2024 — McMap. All rights reserved.