Python: How to check if a string is a valid IRI?
Asked Answered
A

2

14

Is there a standard function to check an IRI, to check an URL apparently I can use:

parts = urlparse.urlsplit(url)  
    if not parts.scheme or not parts.netloc:  
        '''apparently not an url'''

I tried the above with an URL containing Unicode characters:

import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:  
    print "not an url"
else:
    print "yes an url"

and what I get is yes an url. Does this means I'm good an this tests for valid IRI? Is there another way ?

Ardeliaardelis answered 24/9, 2012 at 12:31 Comment(2)
Why shouldn't you be good? Does your example violate any rule defined by the IRI standard? In other words: are you asking us if your test breaks any IRI rules? Did you perform this research yourself?Parks
@Jan-PhilipGehrcke I am asking someone who has more experience than me with IRI, if I am good with this.Ardeliaardelis
K
20

Using urlparse is not sufficient to test for a valid IRI.

Use the rfc3987 package instead:

from rfc3987 import parse

parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')
Knackwurst answered 24/9, 2012 at 12:46 Comment(6)
ImportError: No module named rfc3987 so it is not standard, pip install rfc3987Ardeliaardelis
You have to install the package he links toBoulware
Works (+1), accept, and you are right with:Using urlparse is not sufficient to test for a valid IRI, because with the code provided above url string is not a valid IRI, .Ardeliaardelis
But escaped works: parse('http://fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com/%C4%83%C3%AE%C4%83%C3%AE', rule='IRI') I get: {'fragment': None, 'path': '/%C4%83%C3%AE%C4%83%C3%AE', 'scheme': 'http', 'authority': 'fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com', 'query': None}Ardeliaardelis
Notably, SSH format doesn't comply with URI or IRI: unix.stackexchange.com/q/75668/61349Ardor
I only wish that more people would find this answer when googling.Tallbot
C
1

The only character-set-sensitive code in the implementation of urlparse is requiring that the scheme should contain only ASCII letters, digits and [+-.] characters; otherwise it's completely agnostic so will work fine with non-ASCII characters.

As this is non-documented behaviour, it's your responsibility to check that it continues to be the case (with tests in your project), but I don't imagine it would be changed to break IRIs.

urllib provides quoting functions to convert IRIs to/from ASCII URIs, although they still don't mention IRIs explicitly in the documentation, and they are broken in some cases: Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Coequal answered 24/9, 2012 at 12:41 Comment(2)
urllib.quote(url) seems to escape the : colon in the http:// to http%3A//Ardeliaardelis
@EduardFlorinescu yes, by default it only works for quoting the path section of an IRI; for a full IRI you'd need to parse, quote, and reassemble the components.Coequal

© 2022 - 2024 — McMap. All rights reserved.