Python: How to check if a string is a valid IRI?

Asked 24/9, 2012 at 12:31 Answered 24/9, 2012 at 12:46

Is there a standard function to check an IRI, to check an URL apparently I can use:

parts = urlparse.urlsplit(url)  
    if not parts.scheme or not parts.netloc:  
        '''apparently not an url'''

I tried the above with an URL containing Unicode characters:

import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:  
    print "not an url"
else:
    print "yes an url"

and what I get is yes an url. Does this means I'm good an this tests for valid IRI? Is there another way ?

Ardeliaardelis answered 24/9, 2012 at 12:31 Comment(2)

Why shouldn't you be good? Does your example violate any rule defined by the IRI standard? In other words: are you asking us if your test breaks any IRI rules? Did you perform this research yourself? – Parks 24/9, 2012 at 12:38

@Jan-PhilipGehrcke I am asking someone who has more experience than me with IRI, if I am good with this. – Ardeliaardelis 24/9, 2012 at 12:40

Using urlparse is not sufficient to test for a valid IRI.

Use the rfc3987 package instead:

from rfc3987 import parse

parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')

Knackwurst answered 24/9, 2012 at 12:46 Comment(6)

ImportError: No module named rfc3987 so it is not standard, pip install rfc3987 – Ardeliaardelis 24/9, 2012 at 12:52

You have to install the package he links to – Boulware 24/9, 2012 at 12:53

Works (+1), accept, and you are right with:Using urlparse is not sufficient to test for a valid IRI, because with the code provided above url string is not a valid IRI, . – Ardeliaardelis 24/9, 2012 at 13:0

But escaped works: parse('http://fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com/%C4%83%C3%AE%C4%83%C3%AE', rule='IRI') I get:

{'fragment': None, 'path': '/%C4%83%C3%AE%C4%83%C3%AE', 'scheme': 'http', 'authority': 'fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com', 'query': None}

– Ardeliaardelis 24/9, 2012 at 13:12

Notably, SSH format doesn't comply with URI or IRI: unix.stackexchange.com/q/75668/61349 – Ardor 9/10, 2015 at 23:9

I only wish that more people would find this answer when googling. – Tallbot 14/6, 2017 at 19:1

The only character-set-sensitive code in the implementation of urlparse is requiring that the scheme should contain only ASCII letters, digits and [+-.] characters; otherwise it's completely agnostic so will work fine with non-ASCII characters.

As this is non-documented behaviour, it's your responsibility to check that it continues to be the case (with tests in your project), but I don't imagine it would be changed to break IRIs.

urllib provides quoting functions to convert IRIs to/from ASCII URIs, although they still don't mention IRIs explicitly in the documentation, and they are broken in some cases: Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Coequal answered 24/9, 2012 at 12:41 Comment(2)

urllib.quote(url) seems to escape the : colon in the http:// to http%3A// – Ardeliaardelis 24/9, 2012 at 13:15

@EduardFlorinescu yes, by default it only works for quoting the path section of an IRI; for a full IRI you'd need to parse, quote, and reassemble the components. – Coequal 24/9, 2012 at 13:28

Recommended topics

Hot tags