I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps.
1). Inject: In this part, apache reads url list from given seed.txt, compare urls with regex-urlfiler regex and update crawldb with supported urls.
2). Generate: bin/nutch generate crawl/crawldb crawl/segments Nutch takes URLs from crawldb and create fetch list of URLs which are ready to be fetched. it takes input like -topN and timegap etc then create directory with current time under segments.
I believe, In first two steps there was no interaction with internet. Everything was happening locally.
Q: Where is fetch list kept ?
3). Fetch: bin/nutch fetch crawl/segments/
Fetch run fetchList and fetch contents (and URLs) from given URLs and keep it somewhere.
Q: Does fetch read the whole given page of URL (Text + another URLs)? Q: Where Nutch keeps fetched data ?
4). Parse: bin/nutch parse crawl/segments/
It parses the entries.
Q: What is meant by parse here ? Q: Where i can find result of this step ?
5). bin/nutch updatedb crawl/crawldb crawl/segments/
When this is complete, Nutch update the database with the results of the fetch.
Q: Does it update crawldb with parsed data only or something else also?
Please clear my doubts.