urllib2 read to Unicode

Asked 20/6, 2009 at 3:46 Answered 21/12, 2013 at 2:23

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.

I have tried something like:

import urllib2

req = urllib2.urlopen('http://lenta.ru')
content = req.read()

The content is a byte stream, so I can search it for a Unicode string.

I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.

Biller answered 20/6, 2009 at 3:46 Comment(3)

The encoding is done using a function from the urllib library not from urllib2. From voidspace.org.uk/python/articles/urllib2.shtml#headers – Symbolics 20/6, 2009 at 3:55

@Symbolics this is not the encoding that Vitaly refers to, he is referring to decoding and encoding the actual request context with '[byte string]'.decode('[charset]') and u'[unicode string]'.encode('utf-8'). You are referring to encoding request parameters. – Columbium 8/5, 2012 at 13:57

related: A good way to get the charset/encoding of an HTTP response in Python – Auster 19/8, 2016 at 9:28

100

After the operations you performed, you'll see:

>>> req.headers['content-type']
'text/html; charset=windows-1251'

and so:

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

and you can search, etc, etc.

Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93

Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).

Lysin answered 20/6, 2009 at 4:17 Comment(6)

Hey Alex, thanks for the reply. But if I do: u'Главное' in ucontent it returns False. Is there a better way to do the search? – Biller 20/6, 2009 at 4:28

How are you inputting that u'...' string? Unicode I/O is tricky, as your terminal AND Python must be on identical wavelengths. Using explicit Unicode codepoints (boring but NOT tricky) works fine, let me edit my answer to show that. – Lysin 20/6, 2009 at 4:47

I am inputing using the console, If I need to do this for a unit test what should I set the coding: to at the top of the file? – Biller 20/6, 2009 at 5:9

Depends entirely on how your terminal/console's encoding is set up! See python.org/dev/peps/pep-0263 -- e.g. for utf-8 use the comment # -- coding: utf-8 -- at file start. – Lysin 20/6, 2009 at 5:44

Using .split on the response header to extract the charset parameter is cheating. What if there is another parameter after a semicolon? – Nacelle 28/6, 2014 at 8:33

As @RolandIllig points out, parsing content-type like this is unreliable; besides, Python provides functions to do it. See this answer. – Kaete 11/7, 2015 at 2:24

To parse Content-Type http header, you could use cgi.parse_header function:

import cgi
import urllib2

r = urllib2.urlopen('http://lenta.ru')
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset', 'utf-8')
unicode_text = r.read().decode(encoding)

Another way to get the charset:

>>> import urllib2
>>> r = urllib2.urlopen('http://lenta.ru')
>>> r.headers.getparam('charset')
'utf-8'

Or in Python 3:

>>> import urllib.request
>>> r = urllib.request.urlopen('http://lenta.ru')
>>> r.headers.get_content_charset()
'utf-8'

Character encoding can also be specified inside html document e.g., <meta charset="utf-8">.

Auster answered 21/12, 2013 at 2:23 Comment(0)

Recommended topics

Hot tags