Web Crawler Engine used by Kentico 10

Asked 31/8, 2017 at 16:14 Answered 13/12, 2017 at 21:18

Is there more information available about the web crawler technology/engine used by Kentico 10 as per documentation Configuring Page Crawler Indexes?

The reason I'm asking is because I'd like to consider it for use in a custom crawler project that can sit outside of Kentico, and still allow for it to have an inherent compatibility with the Kentico platform.

Quaggy answered 31/8, 2017 at 16:14 Comment(0)

As far as I can tell from the Kentico 10 source code, the crawler used by Kentico SmartSearch is completely proprietary. It's not using any 3rd party library.

It downloads the page content using System.Web.HttpWebRequest. The full content is fed back into the SmartSearch indexer as a string. After that it goes through text extraction and is fed to Lucene for indexing.

It's not going to be easy to have Kentico SmartSearch use an external crawler. We usually stay away from the crawler because it is rather expensive to execute compared to the standard index that pulls data straight from the database.

Kentico supports executing some scheduled tasks in a Windows service but not the search tasks.

Note that Kentico SmartSearch doesn't actually crawl the site by discovering links. It uses the content tree to figure out what content it needs to index. If you want to index other content, for example from a system you integrate with, you need to implement a custom search service as described here.

One thing that would work is to have an external process crawl whatever content you want to index and put the raw HTML content into storage. Then write a custom SmartSearch index that pulls the data from storage for indexing within Kentico. If you're indexing content managed by Kentico, you could take that to the next level by hooking into document events. That should allow you to crawl pages only when they're updated.

Duckling answered 13/12, 2017 at 21:18 Comment(0)

-1

Kentico uses Lucene .NET. It's a great solution for stand-alone projects. I used it to power a custom web API hosted in Azure.

Mike

Backflow answered 31/8, 2017 at 21:50 Comment(1)

Kentico search is called SmartSearch and uses Lucene as the search engine but the question asks what web crawler Kentico uses. AFAIK Lucene can't crawl pages. – Quaggy 1/9, 2017 at 2:3

-1

Lucene uses Nutch http://nutch.apache.org/ which is an open source web crawler to index web content. It's part of the entire framework that lucene offers.

Foreign answered 1/9, 2017 at 4:33 Comment(3)

Lucene does not use Nutch but Nutch used to use Lucene. – Extraneous 1/9, 2017 at 7:1

Your link doesn't verify that. However, it answers your question that Nutch is web crawler which is used within Nutch, Lucen and Solr landscape. However I answered your original question correctly. You're free to down vote it. groups.drupal.org/lucene-nutch-and-solr – Foreign 1/9, 2017 at 14:11

In reference to the original question, can you prove that Kentico 10 uses Nutch to crawl? – Quaggy 17/10, 2017 at 19:42

Recommended topics

Hot tags