what does read() in urlopen('http.....').read() do? [urllib]

Asked 8/3, 2016 at 9:30 Answered 8/3, 2016 at 9:45

Hi I'm reading "Web Scraping with Python (2015)". I saw the following two ways of opening url, with and without using .read(). See bs1 and bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

So is .read() redundant? Thanks

Code on p7 of Web scpraing with python: (use .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

Code on p15 (without .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Septi answered 8/3, 2016 at 9:30 Comment(3)

In addition to the answers above, I suggest you try to use the requests library for HTTP requests docs.python-requests.org/en/latest You'll be more in control of the HTTP response – Zabaglione 8/3, 2016 at 15:4

thanks @A.Romeu could you refer me some post for more info please? I do need to fit form and get response webpage in the next step, where I plan to use mechanize – Septi 8/3, 2016 at 23:27

On the link I sent you, there is a lot of information on how to use it, under the section 'The User Guide'. You can start directly with docs.python-requests.org/en/latest/user/quickstart/… – Zabaglione 9/3, 2016 at 8:36

Quoting BS docs:

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

When you're using .read() method you use an "string" inteface. When you are not, you're using "filehandle" interface.

Effectively it works same way (although BS4 may read file-like object in lazy way). In your case whole content is read to string object (it's may consume more memory unnecessarily).

Aylmar answered 8/3, 2016 at 9:38 Comment(0)

urllib.request.urlopen returns a file-like object, the read method of it will return the response body of that url.

BeautifulSoup constructor accepts both a string or an open filehandle, so yes, read() is redundant here.

Spanker answered 8/3, 2016 at 9:38 Comment(0)

Without BeautifulSoup Module

.read() is useful when you are not using the "BeautifulSoup" Module thus making it non-redundant in this case. Only if you use .read() you will get the html content, without which you will just have the object returned by .urlopen()

With BeautifulSoup Module

The BS module has 2 constructors for this feature, one will accept String and the other will accept the object returned by .urlopen(some-site)

James answered 8/3, 2016 at 9:45 Comment(0)

Recommended topics

Hot tags