What technology for large scale scraping/parsing? [closed]
Asked Answered
A

4

8

We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). .

We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.

So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.

We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time. We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.

Axenic answered 29/6, 2010 at 17:50 Comment(0)
M
7

If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2 for making the requests and lxml for parsing the results.

If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing.

UPDATE: If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.

Miff answered 29/6, 2010 at 18:0 Comment(0)
N
3

You should probably use tools used for testing web applications (WatiN or Selenium).

You can then compose your workflow separated from the data using a tool I've written.

https://github.com/leblancmeneses/RobustHaven.IntegrationTests

You shouldn't have to do any manual parsing when using WatiN or Selenium. You'll instead write an css querySelector.

Using TopShelf and NServiceBus you can scale the # of workers horizontally.

FYI: With mono these tools i mention can run on Linux. (although miles may vary)

If JavaScript doesn't need to be evaluated to load data dynamically: Anything requiring the document to be loaded in memory is going waste time. If you know where your tag is, all you need is a sax parser.

Nephelometer answered 4/5, 2012 at 4:27 Comment(1)
By the way NServiceBus provides (distribution, persistence, security, transactions, and reliability to queue work) - sample: github.com/leblancmeneses/NWebHooksNephelometer
T
1

I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.

The slowest part of the operation is making the http requests.

Towboat answered 29/6, 2010 at 17:54 Comment(0)
H
0

what about c++? there are many large scale libraries can help you.

boost asio can help you do the network.

TinyXML can parse XML files.

I have no idea about database, but almost all database have interfaces for c++, it is not a problem.

Hornwort answered 9/5, 2012 at 9:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.