Dissecting The Nutch Crawler -introduction
来源:互联网 发布:sql语句查询信息语句 编辑:程序博客网 时间:2024/04/27 19:10
英文原文出处:DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy
Nutch使用java实现的,所以我们假定你有基本的相关知识。
转载本文请注明出处:http://blog.csdn.net/pwlazy
Introduction
The open-source Nutch search engine consists, very roughly, of three components:
-
the crawler, which discovers and retrieves web pages
-
theWebDB, a custom database that stores knownURLs and fetched page contents
-
the indexer, which dissects pages and builds keyword-based indexes from them
This document attempts to describe the operation of the crawler. We begin with theory and drill down to into the details needed to create a customized crawler.
Nutch is implemented in Java, so basic knowledge of the language is assumed.
介绍
开源Nutch搜索引擎大致包含3部分
- crawler,发觉和检索网页
- theWebDB,一个定制的数据库用于存储已知的url和检索的网页内容
- indexer,剖析页面以及从中构建基于关键词的索引
Nutch使用java实现的,所以我们假定你有基本的相关知识。
注:本人英文水平有限,翻译不当之处请批评指正,谢谢
- Dissecting The Nutch Crawler -introduction
- Dissecting The Nutch Crawler -Summary: Nutch crawler extension points
- Dissecting The Nutch Crawler -Factory classes: Overview
- Dissecting The Nutch Crawler -Factory classes: '''URLFilterFactory'''
- Dissecting The Nutch Crawler - The "nutch" shell script
- Dissecting The Nutch Crawler - Command "crawl": net.nutch.tools.CrawlTool
- Dissecting The Nutch Crawler - Command "inject": net.nutch.db.WebDBInjector
- Dissecting The Nutch Crawler -Command "generate": net.nutch.tools.FetchListTool
- Dissecting The Nutch Crawler -Command "fetch": net.nutch.fetcher.Fetcher
- Dissecting The Nutch Crawler -Aside: net.nutch.util.NutchConfig
- Dissecting The Nutch Crawler -Factory classes: '''ParserFactory''', '''ProtocolFactory'''
- Dissecting The Nutch Crawler - Command "admin -create": net.nutch.tools.WebDBAdminTool
- Heritrix Crawler vs. Nutch Crawler
- Nutch Crawler工作流程
- Nutch Crawler工作流程
- nutch crawler 解析 下
- Dissecting the Camera Matrix
- Dissecting the SDK of Android
- 大连啤酒节
- 深入了解网络中的蠕虫病毒 [多图]
- 代友招聘symbian工程师
- ASP.NET ViewState初探
- 正则表达式
- Dissecting The Nutch Crawler -introduction
- CORBA(类型编码TypeCode)
- 奋斗了n(n>7)小时,终于解决了连接远程JMS JNDI的问题:java.rmi.NoSuchObjectException: no such object in table
- SCHEMA典型范例
- CSD
- 欺骗的艺术(第三章 正面攻击――直接索取三)
- 欺骗的艺术(第三章 正面攻击――直接索取四)
- 客户亲和力
- 如何用T-SQL语句建立跟踪