How do craigslist mashups get data? [closed]
Asked Answered
R

8

29

I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.

For example, www.housingmaps.com and the now closed www.chicagocrime.org

If there is a URL that can be used for reference, that would be perfect!

Recur answered 25/10, 2008 at 22:32 Comment(3)
Just wanted to add an update to this thread. It seems that in 2013, a federal judge found that circumventing an IP block (specifically by craigslist) violates the CFAA: en.wikipedia.org/wiki/… hic sunt draconesHurley
Sad, but true. Check out how Craigslist shut down (sort of) 3Taps: en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.Pictograph
Similar question from 2015 - opendata.stackexchange.com/q/5883/1511Sarasvati
R
0

While continuing to research this area, I found an awesome site that does partly what I'm interested in:

Crazedlist

It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.

Recur answered 23/11, 2008 at 1:27 Comment(0)
T
15

For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.

For example, to extract the categories you could:

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames;   
else
    return $emptyArray = array();
Tsarevna answered 28/10, 2008 at 23:42 Comment(0)
H
13

An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.

3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.

80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.

Hulk answered 11/12, 2011 at 16:31 Comment(0)
B
4

The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.

That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).

What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.

Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.

Blaine answered 27/1, 2010 at 7:5 Comment(6)
This still holds true. For those who don't recognize the logo, @Nathan Stretch wrote SearchTempest, the best Craigslist aggregator/search tool I'd seen. Used it to buy two cars, but didn't realize Craigslist nuked much of its efficacy a few years ago. :(Pictograph
@Eirik hopefully not all of its efficacy! We do go to significant efforts tailoring our Google queries to provide as complete of results as possible. Better than any competitor, to the best of my knowledge. We also have a Direct mode that works similarly to how the iframes used to, except using separate windows since frames are no longer an option.Blaine
@NathanStretch could you explain (or give some links to find out more) how the iframes used to work (for craigslist)? I am just curious about technology that was used before (I know it's not working anymore). ThanksSidecar
@Sidecar it was pretty simple—basically we just opened each craigslist results page matching the user's search in an iframe, one for each city. We would show around 10 stacked on a page. And actually, we've come full circle somewhat and do something similar again, except that we open craigslist results pages each covering multiple cities, and just work out how to combine those regions to cover the user's search in as few as possible. Then instead of iframes, we simply open each link in a new tab.Blaine
@NathanStretch And why did you use iframes in the first place? Just for convenience / UX, to put multiple results pages inside one browser window?Sidecar
@Sidecar pretty much, yeah. Back then craigslist only allowed you to search one city at a time, so opening them all in separate tabs would mean something like 400 tabs for the entire US. It's much more reasonable now that it's ~25.Blaine
B
4

The alternative option would be to use YQL or Yahoo pipes to gather the results.

Craiglook and HousingMaps are using them to gather results

But answered 2/2, 2010 at 15:37 Comment(1)
Looks like Pipes is getting 403 Forbidden from CL these days.Perrine
M
3

I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.

For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.

Mellott answered 13/12, 2011 at 20:4 Comment(0)
F
2

i am guessing screen scraping

i do not think there is a craigslist API yet.. and i do not think they will release one..

so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page

if you see a link .. access the page.. scrape the new page get the data and show it or store it

and so on..

Fredrika answered 25/10, 2008 at 22:33 Comment(3)
-1 NEVER EVER use RegEx to parse XMLBroadcloth
Parse XML in Rails with the nokogiri gem.Brisco
@unknwntech: the xhtml on a craigslist page is extraordinarily simple. You're right that you don't use regex to parse XML, but in this case you don't have to. You're just pulling out specific items on the page, which is faster than using a full-blown XML parser.Anzus
B
2

I just made one:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.js

That produces:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.html

Must be run in rhino.

Bonner answered 21/1, 2010 at 23:0 Comment(0)
R
0

While continuing to research this area, I found an awesome site that does partly what I'm interested in:

Crazedlist

It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.

Recur answered 23/11, 2008 at 1:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.