What is the practical difference between these two ways of making web connections in Python?
Asked Answered
W

1

5

I have notice there are several ways to iniciate http connections for web scraping. I am not sure if some are more recent and up-to-date ways of coding, or if they are just different modules with different advantages and disadvantages. More specifically, I am trying to understand what are the differences between the following two approaches, and what would you reccomend?

1) Using urllib3:

http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")

2) Using requests

html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")

What sets these two options apart, besides the simple fact that they require importing different modules?

Weathertight answered 29/4, 2016 at 11:24 Comment(4)
The requests module uses (and bundles / vendorizes urllib3) under the hood - but it provides a slightly more higher level and simpler API on top of it.Quest
Set aside the fact that requests provides a higher level API, probably with a bit less code, are there situations when it would be preferable to opt for one or the other? Or is it generally a better option to go entirely for requests?Weathertight
My recommendation is to always use requests. It just makes HTTP very pleasant to deal with, and if there's something you can't do with requests that you can with plain urllib3, I haven't come across it yet. But that's just my opinion.Quest
requests vendors a bunch of libraries (including urllib3, certifi, etc) and pre-configures them for you. If you need lower-level access or just like having control over what's going on, then you can use urllib3 directly.Avent
A
8

Under the hood, requests uses urllib3 to do most of the http heavy lifting. When used properly, it should be mostly the same unless you need more advanced configuration.

Except, in your particular example they're not the same:

In the urllib3 example, you're re-using connections whereas in the requests example you're not re-using connections. Here's how you can tell:

>>> import requests
>>> requests.packages.urllib3.add_stderr_logger()
2016-04-29 11:43:42,086 DEBUG Added a stderr logging handler to logger: requests.packages.urllib3
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,043 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,158 DEBUG "GET / HTTP/1.1" 200 None
>>> requests.get('https://www.google.com/')
2016-04-29 11:45:59,815 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:45:59,925 DEBUG "GET / HTTP/1.1" 200 None

To start re-using connections like in a urllib3 PoolManager, you need to make a requests session.

>>> session = requests.session()
>>> session.get('https://www.google.com/')
2016-04-29 11:46:49,649 INFO Starting new HTTPS connection (1): www.google.com
2016-04-29 11:46:49,771 DEBUG "GET / HTTP/1.1" 200 None
>>> session.get('https://www.google.com/')
2016-04-29 11:46:50,548 DEBUG "GET / HTTP/1.1" 200 None

Now it's equivalent to what you were doing with http = PoolManager(). One more note: urllib3 is a lower-level more explicit library, so you explicitly create a pool and you'll explicitly need to specify your SSL certificate location, for example. It's an extra line or two of more work but also a fair bit more control if that's what you're looking for.

All said and done, the comparison becomes:

1) Using urllib3:

import urllib3, certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
html = http.request('GET', url).read()
soup = BeautifulSoup(html, "html5lib")

2) Using requests:

import requests
session = requests.session()
html = session.get(url).content
soup = BeautifulSoup(html, "html5lib")
Avent answered 29/4, 2016 at 18:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.