I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.
I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).
My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively,
^[a-zA-Z0-9\-]+$
and ^[a-zA-Z0-9\-\p{L}]+$
. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.
What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).
So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)