Non-ASCII Python identifiers and reflectivity [duplicate]
Asked Answered
O

1

8

I have learnt from PEP 3131 that non-ASCII identifiers were supported in Python, though it's not considered best practice.

However, I get this strange behaviour, where my 𝜏 identifier (U+1D70F) seems to be automatically converted to Ο„ (U+03C4).

class Base(object):
    def __init__(self):
        self.𝜏 = 5 # defined with U+1D70F

a = Base()
print(a.𝜏)     # 5             # (U+1D70F)
print(a.Ο„)     # 5 as well     # (U+03C4) ? another way to access it?
d = a.__dict__ # {'Ο„':  5}     # (U+03C4) ? seems converted
print(d['Ο„'])  # 5             # (U+03C4) ? consistent with the conversion
print(d['𝜏'])  # KeyError: '𝜏' # (U+1D70F) ?! unexpected!

Is that expected behaviour? Why does this silent conversion occur? Does it have anything to see with NFKC normalization? I thought this was only for canonically ordering Unicode character sequences...

Ortego answered 2/1, 2018 at 14:51 Comment(5)
Does defining an encoding make a difference? 03C4 is definitely the decomposition of 1D70F, and it looks from the reference like some normalization happens. – Aware
Your theory seems to be correct. Seems that python interpreter normalises your unicode variable already when assigning it. If you put print(dir(a)) after a has been assigned, you can see there is no trace of U+1D70F character in the class. Your second print statement would then work for the same reason (gets normalised), while your dictionary access fails as dictionaries can take anything as keywords and there would be no reason to normalise or do anything else to them as it is a string in parentheses. – Limbic
@Aware Nope, defining # -*- coding: utf-8 -*- makes no difference. Maybe NFKC is responsible.. but I thought canonisation was just about reordering, not changing the actual character.. 8) – Ortego
@Limbic I guess you're right as well.. but it leads to a quite unexpected behaviour when it comes to indexing __dict__, don't you find? – Ortego
Not at all. As the answer explains, there is no automatic normalisation of string literals, and It would be completely inappropriate to do so anyway. – Limbic
A
11

Per the documentation on identifiers:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

You can see that U+03C4 is the appropriate result using unicodedata:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', '𝜏')
'Ο„'

However, this conversion doesn't apply to string literals, like the one you're using as a dictionary key, hence it's looking for the unconverted character in a dictionary that only contains the converted character.

self.𝜏 = 5  # implicitly converted to "self.Ο„ = 5"
a.𝜏  # implicitly converted to "a.Ο„"
d['𝜏']  # not converted

You can see similar problems with e.g. string literals used with getattr:

>>> getattr(a, '𝜏')
Traceback (most recent call last):
  File "python", line 1, in <module>
AttributeError: 'Base' object has no attribute '𝜏'
>>> getattr(a, unicodedata.normalize('NFKD', '𝜏'))
5
Aware answered 2/1, 2018 at 15:10 Comment(7)
Well, that's interesting. Cheers :) I'll keep thinking that it's a weird behaviour anyway. If 𝜏 was the only character I could access on my keyboard, I couldn't use python reflective __dict__ or getattr features like anybody else.. Should I file this as a bug to python? – Ortego
@Ortego I'm not sure they'd consider it a bug, given that this is the documented behaviour. It certainly surprised me, though! And it makes dynamic attribute access (see the getattr example) a little more complex than initially expected. I guess this is why ASCII identifiers are still recommended; no more from math import pi as Ο€ for me! – Aware
I'll inform them anyway :) What's the best place to do so? – Ortego
@Ortego anything like that should go through bugs.python.org; have a look around, there may be a similar issue logged already. – Aware
Great. Here it is. Thanks again :) – Ortego
@Ortego I'd guess it'll get closed against e.g. bugs.python.org/issue13793 – Aware
Crab! Missed that one :\ You're right. – Ortego

© 2022 - 2024 β€” McMap. All rights reserved.