How do I parse a HTML page with Node.js

Asked 10/9, 2011 at 16:18 Answered 17/11, 2020 at 18:54

113

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

Messeigneurs answered 10/9, 2011 at 16:18 Comment(0)

110

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

BeautifulSoup for python
you can convert you html to xhtml and use XSLT
HTMLAgilityPack for .NET
CsQuery for .NET (my new favorite)
The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

Chromosome answered 10/9, 2011 at 16:24 Comment(8)

What do you mean by good? Reliable, fast, easy? Well with these two, it is robust enough so that you can use jQuery serverside if you wanted to. – Chromosome 10/9, 2011 at 16:29

@Chromosome Reliable and easy are more important to me then if the process ends in one hour or one day. – Messeigneurs 10/9, 2011 at 21:25

I would say that the node option is reliable and is definitely easy if you are already used to the DOM. – Chromosome 11/9, 2011 at 20:22

If you shoot for htmlparser, try going with github.com/fb55/node-htmlparser first. It seems to be a reworked version and is more actively maintained. – Seibel 16/9, 2012 at 19:31

I searched all over the internet but cannot find a good tutorial for htmlparser.. – Hygeia 17/5, 2013 at 1:29

jsdom.env is not a function – Sevenup 20/10, 2017 at 19:4

I personally feel jsdom is easier – Egoism 6/3, 2019 at 12:12

Other options, old link on answer, use this instead: pip install beautifulsoup4 – Ramekin 16/1, 2023 at 14:40

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.

Carolanncarole answered 12/11, 2013 at 16:36 Comment(5)

But doesn't build DOM and doesn't allow XPath. jQuery syntax is surely a downside of that library. – Chronon 22/9, 2014 at 6:16

@Chronon in my experience very few applications require full DOM parsing, and building the DOM is very expensive compared to the fast "lazy" evaluation in jQuery/Cheerio. In this sense jQuery-style parsing is a benefit, but if your application requires manipulating the DOM server-side you might prefer to try jsdom. – Carolanncarole 22/9, 2014 at 12:1

jsdom is too slow for that :/ – Chronon 22/9, 2014 at 13:28

@MohamedMansour for what it's worth we're using Cheerio in production and scraping thousands of pages in a few seconds. "fast" and "slow" are all relative to your application and bandwidth of course. – Carolanncarole 4/2, 2016 at 14:10

Non-strict: +1. jQuery syntax: +1. – Disassemble 15/12, 2019 at 21:52

November 2020 Update

I searched for the top NodeJS html parser libraries.

Because my use cases didn't require a library with many features, I could focus on stability and performance.

By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.

Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.

I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):

Last commit is in the last 6 months:

Name	Last commit	Open Issues	Github stars
jsdom	3 Months	331	14.9K
htmlparser2	8 days	2	2.7K
parse5	2 Months	2	2.5K
swagger-parser	2 Months	48	663
html-parse-stringify	4 Months	3	215
node-html-parser	7 days	15	205

Last commit is 6 months and above:

Name	Last commit	Open Issues	Github stars
cheerio	1 year	174	22.9K
koa-bodyparser	6 months	9	1.1K
sax-js	3 Years	65	941
draftjs-to-html	1 Year	27	233

I picked Node-html-parser because it seems quiet fast and very active at this moment.

(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.

(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.

Kindig answered 17/11, 2020 at 18:54 Comment(2)

I love this answer because it highlights the unavoidable process node devs have to go through to vet the large number of available nearly-duplicate modules out there. Here's a 2023 survey I did. The only constant is change. – Measures 2/3, 2023 at 19:56

@Rotem jackoby, I've added some pretty tables for you :) – Cramer 6/3 at 2:36

Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

Swanskin answered 28/11, 2014 at 12:4 Comment(1)

How to get the exact kind of output, that one gets in this demo? – Sevenup 20/10, 2017 at 19:4

Htmlparser2 by FB55 seems to be a good alternative.

Parsnip answered 20/4, 2013 at 18:9 Comment(4)

And what should one do with this return format? Write a bunch of for loops and tree traversals? – Chronon 22/9, 2014 at 6:18

You can register to open/close tag events, so depending on what you want, this is a really good alternative imho. – Standee 4/5, 2015 at 19:20

@Chronon There is also domutils package by the same author that works with the format returned by htmlparser2 - it has lots of methods, some of which have the same syntax as DOM methods, some are different; you won't really need to traverse the object manually. No docs there, but the source code is super clear - it all works as you would expect. – Parsnip 4/5, 2015 at 19:50

not yet, but what stops you extending it? it's not that difficult using functions it already has. – Parsnip 9/5, 2015 at 9:37

jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully

Dufrene answered 24/8, 2013 at 11:40 Comment(0)

November 2020 Update

Recommended topics

Hot tags