What causes `None` results from BeautifulSoup functions? How can I avoid "AttributeError: 'NoneType' object has no attribute..." with BeautifulSoup?
Asked Answered
A

2

3

Often when I try using BeautifulSoup to parse a web page, I get a None result from the BeautifulSoup function, or else an AttributeError is raised.

Here are some self-contained (i.e., no internet access is required as the data is hard-coded) examples, based off an example in the documentation, which don't require Internet access:

>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... 
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
... 
... <p class="story">...</p>
... """
>>> 
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.sister)
None
>>> print(soup.find('a', class_='brother'))
None
>>> print(soup.select_one('a.brother'))
None
>>> soup.select_one('a.brother').text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

I know that None is a special value in Python and that NoneType is its type; but... now what? Why do I get these results, and how can I handle them properly?


This question is specifically about BeautifulSoup methods that look for a single result (like .find). If you get this result using a method like .find_all that normally returns a list, this may be due to a problem with the HTML parser. See Python Beautiful Soup 'NoneType' object error for details.

Automatize answered 26/3, 2023 at 5:6 Comment(0)
A
3

Overview

In general, there are two kinds of queries offered by BeautifulSoup: ones that look for a single specific element (tag, attribute, text etc.), and those which look for each element that meets the requirements.

For the latter group - the ones like .find_all that can give multiple results - the return value will be a list. If there weren't any results, then the list is simply empty. Nice and simple.

However, for methods like .find and .select_one that can only give a single result, if nothing is found in the HTML, the result will be None. BeautifulSoup will not directly raise an exception to explain the problem. Instead, an AttributeError will commonly occur in the following code, which tries to use the None inappropriately (because it expected to receive something else - typically, an instance of the Tag class that BeautifulSoup defines). This happens because None simply doesn't support the operation; it's called an AttributeError because the . syntax means to access an attribute of whatever is on the left-hand side. [TODO: once a proper canonical exists, link to an explanation of what attributes are and what AttributeError is.]

Examples

Let's consider the non-working code examples in the question one by one:

>>> print(soup.sister)
None

This tries to look for a <sister> tag in the HTML (not a different tag that has a class, id or other such attribute equal to sister). There isn't one, so the result is `None.

>>> print(soup.find('a', class_='brother'))
None

This tries to find an <a> tag that has a class attribute equal to brother, like <a href="https://example.com/bobby" class="brother">Bobby</a>. The document doesn't contain anything like that; none of the a tags have that class (they all have the sister class instead).

>>> print(soup.select_one('a.brother'))
None

This is another way to do the same thing as the previous example, with a different method. (Instead of passing a tag name and some attribute values, we pass a CSS query selector.) The result is the same.

>>> soup.select_one('a.brother').text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

Since soup.select_one('a.brother') returned None, this is the same as trying to do None.text. The error means exactly what it says: None doesn't have a text to access. In fact, it doesn't have any "ordinary" attributes; the NoneType class only defines special methods like __str__ (which converts None to the string 'None', so that it can look like the actual text None when it is printed).

Automatize answered 26/3, 2023 at 5:6 Comment(0)
A
2

Common issues with real-world data

Of course, using a small example of hard-coded text makes it clear why certain calls to the find etc. methods fail - the content simply isn't there, and it's immediately obvious just by reading a few lines of data. Any attempt to debug code should start by carefully checking for typos:

>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... 
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
... 
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.find('a', class_='sistre')) # note the typo
None
>>> print(soup.find('a', class_='sister')) # corrected
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In the real world, however, web pages can easily span many kilobytes or even megabytes of text, so that kind of visual inspection isn't practical. In general, for more complex tasks, it's worth taking the time first to check if a given webpage provides an API to access data, rather than scraping it out of page content. Many websites are happy to provide the data directly, in a format that's easier to work with (because it's specifically designed to be worked with as data, rather than to fill in the blanks of a "template" web page).

As a rough overview: an API consists of endpoints - URIs that can be directly accessed in the same way as web page URLs, but the response is something other than a web page. The most common format by far is JSON, although it's possible to use any data format depending on the exact use case - for example, a table of data might be returned as CSV. To use a standard JSON endpoint, write code that figures out the exact URI to use, load it normally, read and parse the JSON response, and proceed with that data. (In some cases, an "API key" will be necessary; a few companies use these to bill for premium data access, but it's usually just so that the information requests can be tied to a specific user.)

Normally this is much easier than anything that could be done with BeautifulSoup, and will save on bandwidth as well. Companies that offer publicly documented APIs for their web pages want you to use them; it's generally better for everyone involved.

All of that said, here are some common reasons why the web response being parsed by BeautifulSoup either doesn't contain what it's expected to, or is otherwise not straightforward to process.

Dynamically (client-side) generated content

Keep in mind that BeautifulSoup processes static HTML, not JavaScript. It can only use data that would be seen when visiting the webpage with JavaScript disabled.

Modern webpages commonly generate a lot of the page data by running JavaScript in the client's web browser. In typical cases, this JavaScript code will make more HTTP requests to get data, format it, and effectively edit the page (alter the DOM) on the fly. BeautifulSoup cannot handle any of this. It sees the JavaScript code in the web page as just more text.

To scrape a dynamic website, consider using Selenium to emulate interacting with the web page.

Alternately, investigate what happens when using the site normally. Typically, the JavaScript code on the page will make calls to API endpoints, which can be seen on the "Network" (or similarly-named) tab of a web browser's developer console. This can be a great hint for understanding the site's API, even if it isn't easy to find good documentation.

User-agent checks

Every HTTP request includes headers that provide information to the server to help the server handle the request. These include information about caches (so the server can decide whether it can use a cached version of the data), acceptable data formats (so the server can e.g. apply compression to the response to save on bandwidth), and about the client (so the server can tweak the output to look right in every web browser).

The last part is done using the "user-agent" part of the header. However, by default, HTML libraries (like urllib and requests) will generally not claim any web browser at all - which, on the server end, is a big red flag for "this user is running a program to scrape web pages, and not actually using a web browser".

Most companies don't like that very much. They would rather have you see the actual web page (including ads). So, the server may simply generate some kind of dummy page (or an HTTP error) instead. (Note: this might include a "too many requests" error, that would otherwise point at a rate limit as described in the next section.)

To work around this, set the header in the appropriate way for the HTTP library:

Rate limits

Another telltale sign of a "bot" is that the same user is requesting multiple web pages as fast as the internet connection will allow, or not even waiting for one page to finish loading before asking for another one. The server tracks who is making requests by IP (and possibly by other "fingerprinting" information) even when logins are not required, and may simply deny page content to someone who is requesting pages too quickly.

Limits like this will usually apply equally to an API (if available) - the server is protecting itself against denial of service attacks. So generally the only work-around will be to fix the code to make requests less frequently, for example by pausing the program between requests.

See for example How to avoid HTTP error 429 (Too Many Requests) python.

Login required

This is pretty straightforward: if the content is normally only available to logged-in users, then the scraping script will have to emulate whatever login procedure the site uses.

Server-side dynamic/randomized names

Keep in mind that the server decides what to send for every request. It doesn't have to be the same thing every time, and it doesn't have to correspond to any actual files in the server's permanent storage.

For example, it could include randomized class names or IDs generated on the fly, that could potentially be different every time the page is accessed. Trickier yet: because of caching, the name could appear to be consistent... until the cache expires.

If a class name or ID in the HTML source seems to have a bunch of meaningless junk characters in it, consider not relying on that name staying consistent - think of another way to identify the necessary data. Alternatively, it might be possible to figure out a tag ID dynamically, by seeing how some other tag in the HTML refers to it.

Irregularly structured data

Suppose for example that a company web site's "About" page displays contact information for several key staff members, with a <div class="staff"> tag wrapping each person's info. Some of them list an email address, and others do not; when the address isn't listed, the corresponding tag is completely absent, rather than just not having any text:

soup = BeautifulSoup("""<html>
<head><title>Company staff</title></head><body>
<div class="staff">Name: <span class="name">Alice A.</span> Email: <span class="email">[email protected]</span></div>
<div class="staff">Name: <span class="name">Bob B.</span> Email: <span class="email">[email protected]</span></div>
<div class="staff">Name: <span class="name">Cameron C.</span></div>
</body>
</html>""", 'html.parser')

Trying to iterate and print each name and email will fail, because of the missing email:

>>> for staff in soup.select('div.staff'):
...     print('Name:', staff.find('span', class_='name').text)
...     print('Email:', staff.find('span', class_='email').text)
... 
Name: Alice A.
Email: [email protected]
Name: Bob B.
Email: [email protected]
Name: Cameron C.
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

This is simply an irregularity that has to be expected and handled.

However, depending on the exact requirements, there may be more elegant approaches. If the goal is simply to collect all email addresses (without worrying about names), for example, we might first try code that processes the child tags with a list comprehension:

>>> [staff.find('span', class_='email').text for staff in soup.select('div.staff')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
AttributeError: 'NoneType' object has no attribute 'text'

We could work around the problem by instead getting a list of emails for each name (which will have either 0 or 1 element), and using a nested list comprehension designed for a flat result:

>>> [email.text for staff in soup.select('div.staff') for email in staff.find_all('span', class_='email')]
['[email protected]', '[email protected]']

Or we could simply use a better query:

>>> # maybe we don't need to check for the div tags at all?
>>> [email.text for email in soup.select('span.email')]
['[email protected]', '[email protected]']
>>> # Or if we do, use a fancy CSS selector:
>>> # look for the span anywhere inside the div
>>> [email.text for email in soup.select('div.staff span.email')]
['[email protected]', '[email protected]']
>>> # require the div as an immediate parent of the span
>>> [email.text for email in soup.select('div.staff > span.email')]
['[email protected]', '[email protected]']

Invalid HTML "corrected" by the browser

HTML is complicated, and real-world HTML is often riddled with typos and minor errors that browsers gloss over. Nobody would use a pedantic browser that just popped up an error message if the page source wasn't 100% perfectly standards-compliant (both to begin with, and after each JavaScript operation) - because such a huge fraction of the web would just disappear from view.

BeautifulSoup allows for this by letting the HTML parser handle it, and letting the user choose an HTML parser if there are others installed besides the standard library one. Web browsers, on the other hand, have their own HTML parsers built in, which might be far more lenient, and also take much more heavy-weight approaches to "correcting" errors.

In this example, the OP's browser showed a <tbody> tag inside a <table> in its "Inspect Element" view, even though that was not present in the actual page source. The HTML parser used by BeautifulSoup, on the other hand, did not; it simply accepted having <tr> tags nested directly within a <table>. Thus, the corresponding Tag element created by BeautifulSoup to represent the table, reported None for its tbody attribute.

Typically, problems like this can be worked around by searching within a subsection of the soup (e.g. by using a CSS selector), rather than trying to "step into" each nested tag. This is analogous to the problem of irregularly structured data.

Not HTML at all

Since it comes up sometimes, and is also relevant to the caveat at the top: not every web request will produce a web page. An image, for example, can't be processed with BeautifulSoup; it doesn't even represent text, let alone HTML. Less obviously, a URL that has something like /api/v1/ in the middle is most likely intended as an API endpoint, not a web page; the response will most likely be JSON formatted data, not HTML. BeautifulSoup is not an appropriate tool for parsing this data.

Modern web browsers will commonly generate a "wrapper" HTML document for such data. For example, if I view an image on Imgur, with the direct image URL (not one of Imgur's own "gallery" pages), and open my browser's web-inspector view, I'll see something like (with some placeholders substituted in):

<html>
    <head>
        <meta name="viewport" content="width=device-width; height=device-height;">
        <link rel="stylesheet" href="resource://content-accessible/ImageDocument.css">
        <link rel="stylesheet" href="resource://content-accessible/TopLevelImageDocument.css">
        <title>[image name] ([format] Image, [width]×[height] pixels) — Scaled ([scale factor])</title>
    </head>
    <body>
        <img src="[url]" alt="[url]" class="transparent shrinkToFit" width="[width]" height="[height]">
    </body>
</html>

For JSON, a much more complex wrapper is generated - which is actually part of how the browser's JSON viewer is implemented.

The important thing to note here is that BeautifulSoup will not see any such HTML when the Python code makes a web request - the request was never filtered through a web browser, and it's the local browser that creates this HTML, not the remote server.

Automatize answered 30/3, 2023 at 6:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.