Apache Nutch: Get outlink URL's text context - McMap

About

Apache Nutch: Get outlink URL's text context

Asked 9/3, 2014 at 14:47 Answered 10/3, 2014 at 10:22

Solved apache hadoop web-scraping nutch

D

1

7

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here. For more information about Apache Nutch, please see the Nutch wiki.

In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!

Dodge answered 9/3, 2014 at 14:47 Comment(0)

U

4

What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.

Here you find some examples how to do that using Python Scrapy:

On Hadoop the best way to go is to implement a crawling using selectors:

The cascading can be used to address the URL you specify:

Hadoop and Cascading

After having the data, you can also use R to optimize analysis:

If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

Urethrectomy answered 10/3, 2014 at 10:22 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.