nutch draft

来源：互联网发布：阿里云备案花钱吗编辑：程序博客网时间：2024/05/21 13:57

The Crawl Database is a data store where Nutch stores every URL，together with the metadata that it

knows about。

In Hadoop terms it's a Sequence file (meaning all records

are stored in sequential manner) consisting of tuples of URL and
CrawlDatum.

Operations (like inserts, deletes and updates) in Crawl
Database and other data are processed in batch mode. Here is an example
of the contents of crawldb:

http://www.example.com/page1.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...http://www.example.com/page2.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...http://www.example.com/page3.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...

The Link Database is a data structure (Sequence file, URL ->
Inlinks) that contains all inverted links.

In the parsing phase Nutch
can extract outlinks from a document and store them in format source url
-> target_url,anchor_text.

Inject

IThe Inject command in Nutch has one responsibility: inject more
URLs into Crawl Database. Normally you should collect a set of URLs to
add and then process them in one batch to keep the time of a single
insert small.

Job1: Convert plain text into URL,CrawlDatum tuples and dedupe(mr task)

Job2: Merge with existing CrawlDB, dedupe(mr task)

Generate

The Generate command in Nutch is used to generate a list of URLs
to fetch from Crawl Database URLs with the highest scores are
preferred.

Fetch

Fetcher is responsible for fetching content from URLs and writing
them to disk. It also optionally parses the content. URLs are read from
a Fetch List generated by Generator.

Parse

Parser reads raw fetched content, parses it and stores the results.

UpdateDB

The UpdateDB command reads the CrawlDatums from Segment (extracted
URLs) and merges them to the existing CrawlDB.

Invert links

Inverts link information so we can use anchor texts from other
documents that point to a document together with the rest of the
document data.