nutch draft
来源:互联网 发布:阿里云备案花钱吗 编辑:程序博客网 时间:2024/05/21 13:57
The Crawl Database is a data store where Nutch stores every URL,together with the metadata that it
knows about。
In Hadoop terms it's a Sequence file (meaning all records
are stored in sequential manner) consisting of tuples of URL and
CrawlDatum.
Operations (like inserts, deletes and updates) in Crawl
Database and other data are processed in batch mode. Here is an example
of the contents of crawldb:
The Link Database is a data structure (Sequence file, URL ->
Inlinks) that contains all inverted links.
In the parsing phase Nutch
can extract outlinks from a document and store them in format source url
-> target_url,anchor_text.
Inject
IThe Inject command in Nutch has one responsibility: inject more
URLs into Crawl Database. Normally you should collect a set of URLs to
add and then process them in one batch to keep the time of a single
insert small.
Job1: Convert plain text into URL,CrawlDatum tuples and dedupe(mr task)
Job2: Merge with existing CrawlDB, dedupe(mr task)
Generate
The Generate command in Nutch is used to generate a list of URLs
to fetch from Crawl Database URLs with the highest scores are
preferred.
Fetch
Fetcher is responsible for fetching content from URLs and writing
them to disk. It also optionally parses the content. URLs are read from
a Fetch List generated by Generator.
Parse
Parser reads raw fetched content, parses it and stores the results.
UpdateDB
The UpdateDB command reads the CrawlDatums from Segment (extracted
URLs) and merges them to the existing CrawlDB.
Invert links
Inverts link information so we can use anchor texts from other
documents that point to a document together with the rest of the
document data.
- nutch draft
- Draft
- Draft
- Draft
- draft
- draft
- draft
- nutch
- nutch
- Nutch
- Nutch
- nutch
- Nutch
- nutch
- Nutch
- Nutch
- nutch
- nutch
- MFC 中WM_消息处理对应的处理函数
- WM_ 对应的处理函数 MFC
- 系统安全的基本概念和权限控制系统的类型
- dom4j处理超大XML
- Web开发人员应当知道的15个开源项目
- nutch draft
- 怎样成为优秀软件模型设计者
- Spring AOP的底层实现技术
- VBA实用代码2:数组横向竖向填充至单元格[20110227]
- 软件天才与技术民工
- 可以有效改进项目管理技能的十个过程
- MFC如何显示位图
- 更改数据时并发冲突的解决办法
- MFC中timer 的使用