difference between lxml and html5lib in the context of beautifulsoup
Asked Answered
S

2

9

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --

ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'): 
    print item['href']

I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?

I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt and html5lib-1.0b3

EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com

Saraband answered 3/9, 2013 at 0:44 Comment(1)
You can use html5lib parse and BeautifulSoup parser within lxml. See lxml.de/elementsoup.html & lxml.de/html5parser.htmlArchil
A
12

html5lib uses the HTML parsing algorithm as defined in the HTML spec, and as implemented in all major browsers. lxml uses libxml2's HTML parser — this is based on their XML parser, ultimately, and does not follow any error handling for invalid HTML used anywhere else.

Most web developers only test with web browsers — standards be damned — so if you want to get what the page's author intended, you'll likely need to use something like html5lib that matches current browsers,

Alliaceous answered 4/9, 2013 at 17:11 Comment(0)
Y
-1

You can remove lxml altogether

pip uninstall lxml
Yung answered 27/11, 2023 at 20:25 Comment(2)
Please read How to Answer. In particular, please make sure that the answers you post actually address the question being asked.Bast
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewCostermansville

© 2022 - 2024 — McMap. All rights reserved.