How is an aggregator built? [closed]

Asked 29/5, 2009 at 22:36 Answered 4/8, 2010 at 0:20

web-services aggregation web-crawler nutch

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that?

Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an indexing system to index and organize the information I crawled and also be a search engine?

Are systems like Nutch lucene.apache.org/nutch OK to be used for what I want? Do you recommend something else?

Or can you recommend another approach?

For example, how Techmeme.com is built? (it's an aggregator of technology news and it's completely automated - only recently they added some human intervention). What would it take to build such a service?

Or how do Kayak.com aggregate their data? (It's a travel aggregator service.)

Redshank answered 29/5, 2009 at 22:36 Comment(0)

This all depends on the aggregator you are looking for.

Types:

Losely defined - Generially this requires for you datasource to be very flexible about determining the type of information gathers (answers the question of is this site/information Travel Related? Humour? Business related? )
Specific - This relaxes a requirement in the data storage that all of the data is specificially travel related requires for flights, hotel prices, etc.

Typcially an aggregator is a system of sub programs:

Grabber, this searches and grabs all of the content that is needed to be summarized
Summerization- this is typically done through queries to the db and can be adjusted based on user preferences [through programming logic]
View - this formats the information for what the user would like to see and can respond to feedback on the user's likes or dislikes of the item suggested.

Thermosetting answered 8/10, 2009 at 5:41 Comment(0)

For a basic look - check out this: http://en.wikipedia.org/wiki/Aggregator

It will give you an overview of aggregators in general.

In terms of how to build your own aggregator if you're looking for something out of the box that can get you content that YOU want - I'd suggest this: http://dailyme.com/

If you're looking for a codebase / architecture to BUILD your own aggregator-service - I'd suggest looking at something straight forward - like: Open Reddit from http://www.reddit.com/

Enterostomy answered 29/5, 2009 at 23:17 Comment(1)

Yes, I would want my own aggregator. Reddit is like a Digg site and that means users will submit links and vote on them (Pligg or SocialWebCMS are also software which allows you to built something like Digg). What I want is more like Techmeme (where the news are gathered automatically and the editors can rank or show them on the site, if necessary). – Redshank 1/6, 2009 at 17:0

You need to define what your application is going to do. Building your own web crawler is a huge task as you tend to keep adding new features as you find you need them... only to complicate your design, etc...

Building an aggregator is much different. Whereas a crawler simply retrieves data to be processed later, an aggregator takes already defined sets of data and puts them together. If you use an aggregator, you will probably want to look for already defined travel feeds, financial feeds, travel data, etc... An aggregator is easier to build IMO, but it's more constrained.

If you, instead, want to build a crawler you'll need to define starting pages, define ending conditions (crawl depth, time, etc...) and so on and then still process the data afterwards (that is aggregate, summarize and so on).

Glow answered 4/8, 2010 at 0:20 Comment(0)

Types:

Typcially an aggregator is a system of sub programs:

Recommended topics

Hot tags