Suggestion for building search engine using Django
Asked Answered
D

2

5

Im new in web crawling. I'm going to build a search engine which the crawler saves Rapidshare links including URL where that Rapidshare links found...

In other words, I'm going to build a website similar to filestube.com

After some searching, I've found Scrapy works with Django. I've tried to find about nutch integration with Django, but found nothing

I hope you can give me suggestion for building this kind of website... especially the crawler

Doubles answered 7/1, 2011 at 15:5 Comment(0)
K
8

The best known pluggable app for that is Django-Haystack which allows you to connect to several search backends :

  • Solr / Lucene the buzzword-compliant Apache foundation project
  • Whoosh a native python search library
  • Xapian another very good semantic search engine

haystack allows you to use an API which looks like Django's own Queryset syntax to use directly these search engines (which all happens to have their own API and dialects).

If you're juste after scraping tools, whatever tool you'll use : BeautifulSoup or Scrappy, you'll be on your own, writing python code that will parse what you want to parse, and then populate your django models.
This can even be separate python scripts , available in the commands.py module.

If you have a lot of files to search, you will probably need an index, which is rebuilt frequently and allows fast searches without hitting the django ORM.
Using a Solr index (for example) enables you to create other fields on-the-fly, like virtual fields based on your real model's fields (ex : splitting author firstname and lastname, adding an uppercased file title field, whatever)

Of course, f you don't need speedy indexation, keyword boost or semantic analysis, you still can do a classic full-text search over a couple of django model fields i :

Kym answered 7/1, 2011 at 17:21 Comment(1)
BeautifulSoup is damn slow and dead:) scrappy is better and it's using etreeAnthropology
B
1

Have you checked DjangoItem? It's an experimental Scrapy feature, but it's known to work

Bullbat answered 12/1, 2011 at 2:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.