Passing web data into Beautiful Soup - Empty list
Asked Answered
S

4

7

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

Subgenus answered 31/7, 2014 at 19:40 Comment(8)
It clearly is getting converted to a BeautifulSoup object—otherwise, soup.title would have raised an exception rather than giving you None. A better way to tell is to print out type(soup).Necrolatry
your code is getting nothing, try printing content.read()Our
Is there a reason you're manually constructing a pool and then calling "the lowest level call for making a request" on it?Necrolatry
@PadraicCunningham content.read() gives b''Subgenus
b for bytes and an empty stringOur
@Necrolatry I'd also looked that the module and read about the lowest level but I didn't understand it well and thought urlopen() was the lowest level, so I chose the latter.Subgenus
@user3885774: Yes, urlopen is the lowest level. Unless you have some good reason, you do not want to use the lowest level. Especially if you're just learning. That's why that same documentation recommends, at least twice, that you use one of the convenience methods. While you could learn all the nitty-gritty details of how urllib3 works under the covers, wouldn't you rather first learn how to use it the easy way, and write some working code you can play with to learn further?Necrolatry
@Necrolatry Agreed, I didn't interpret/understand the module's notes well. : )Subgenus
O
16

If you just want to scrape the page, requests will get the content you need:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'
Our answered 31/7, 2014 at 20:8 Comment(0)
A
12

urllib3 returns a Response object, which contains the .data which has the preloaded body payload.

Per the top quickstart usage example here, I would do something like this:

import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data)  # Note the use of the .data property
...

The rest should work as intended.

--

A little about what went wrong in your original code:

You passed the entire response object rather than the body payload. This should normally be fine because the response object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read(). It's like passing a filepointer which has already been read. .data on the other hand will access the already-read data.

If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:

response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response)  # We can pass the original `response` object now.

Now it should work as you expected.

I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False the default someday. Perhaps someday soon (I opened an issue here).

--

A quick note on .urlopen vs .request:

.urlopen assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen because you're not passing any parameters to the request, but in general .request will do all the extra work for you so it's more convenient.

If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!

Artifice answered 31/7, 2014 at 21:5 Comment(2)
I really appreciate your explanations, I admit I had no idea what content preloading was in exact terms. I'm new to Python and related items, while I knew that URL params were often needed for more precise operations, I thought that urlopen was more basic and a standard/preferred method. : )Subgenus
No worries, your experience is useful feedback for me. :)Artifice
N
2

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content…

What you've called content isn't the content, but a file-like object that you can read the content from. BeautifulSoup is perfectly happy taking such a thing, but it's not very helpful to print it out for debugging purposes. So, let's actually read the content out of it to make this easier to debug:

>>> response = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> response
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> content = response.read()
>>> content
b''

This should make it pretty clear that BeautifulSoup is not the problem here. But continuing on:

… but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup).

Yes it does. The fact that soup.title gave you None instead of raising an AttributeError is pretty good evidence, but you can test it directly:

>>> type(soup)
bs4.BeautifulSoup

That's definitely a BeautifulSoup object.

When you pass BeautifulSoup an empty string, exactly what you get back will depend on which parser it's using under the covers; if it's relying on the Python 3.x stdlib, what you'll get is an html node with an empty head, and empty body, and nothing else. So, when you look for a title node, there isn't one, and you get None.


So, how do you fix this?

As the documentation says, you're using "the lowest level call for making a request, so you’ll need to specify all the raw details." What are those raw details? Honestly, if you don't already know, you shouldn't be using this method Teaching you how to deal with the under-the-hood details of urllib3 before you even know the basics would not be doing you a service.

In fact, you really don't need urllib3 here at all. Just use the modules that come with Python:

>>> # on Python 2.x, instead do: from urllib2 import urlopen 
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.crummy.com/software/BeautifulSoup/')
>>> soup = BeautifulSoup(r)
>>> soup.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'
Necrolatry answered 31/7, 2014 at 19:56 Comment(6)
Thanks, but when I tried further parsing, I didn't get anything like soup.find_all(True) and soup.get_text(), so I was confused.Subgenus
@user3885774: That's what my last paragraph explains: you may have an empty soup, or a soup with just an html node with an empty head and body, but it really doesn't matter; there's no useful data, so who cares exactly how that lack of useful data is represented?Necrolatry
urllib3 actually returns a file-liked object but it's consumed by default (this is not ideal, as I mentioned in my answer below and opened an issue). To fix that, use preload_content=False in the request parameter.Artifice
@shazow: Or, more simply, just use r.data, which is where the preloaded content goes. Or, even more simply, don't use urllib3 if you don't need it and it's too complicated for you to find what you need in the docs…Necrolatry
@Necrolatry Or give the author of urllib3 feedback for how to make it not too complicated so that he can fix it. :) Or even more preferred, come help with improving it!Artifice
@shazow: Honestly, I've only ever really looked at urllib3 twice. Both times, I expected requests to be able to do something for me like magic, and it couldn't, so I looked under the covers, saw that urllib3 made it easy to do what I wanted, and wrote a patch to expose the behavior to requests. Both times, I didn't see anything to be unhappy about in urllib3, so I don't have any real suggestions for improving it. But I did reply to your #436.Necrolatry
Y
0

My beautiful soup code was working in one environment (my local machine) and returning an empty list in another one (ubuntu 14 server).

I've resolved my problem changing the installation. details in other thread:

Html parsing with Beautiful Soup returns empty list

Yeo answered 24/7, 2015 at 19:45 Comment(1)
Note that link-only answers are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference.Eolic

© 2022 - 2024 — McMap. All rights reserved.