读取Webpage表中的内容

来源:互联网 发布:优淘网源码 编辑:程序博客网 时间:2024/05/17 03:10



    nutch将从网页中抓取到的信息放入hbase数据库中,默认情况下表名为$crawlId_webpage,但表中的内容以16进制进行表示,直接scan或者通过Java API进行读取均只能读取到16进制信息。
    因此nutch提供了readdb选项进行数据获取,将表中的内容读取到一个文本中。

 具体用法为:

$ bin/nutch readdbUsage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])                      [-crawlId <id>] [-content] [-headers] [-links] [-text]    -crawlId <id>  - the id to prefix the schemas to operate on,                     (default: storage.crawl.id)    -stats [-sort] - print overall statistics to System.out    [-sort]        - list status sorted by host    -url <url>     - print information on <url> to System.out    -dump <out_dir> [-regex regex] - dump the webtable to a text file in                     <out_dir>    -content       - dump also raw content    -headers       - dump protocol headers    -links         - dump links    -text          - dump extracted text    [-regex]       - filter on the URL of the webtable entry

示例:
(1)seed.txt的内容为:
http://www.163.com

(2)执行以下命令进行inject操作
 bin/nutch inject seed.txt -crawlId test001

(3)scan表中内容,发现无意义

hbase(main):002:0> scan 'test001_webpage'ROW                                         COLUMN+CELL                                                                                                                  com.163.money:http/                        column=f:fi, timestamp=1423550107073, value=\x00'\x8D\x00                                                                   com.163.money:http/                        column=f:ts, timestamp=1423550107073, value=\x00\x00\x01Kr2\xC7\xD6                                                         com.163.money:http/                        column=mk:_injmrk_, timestamp=1423550107073, value=y                                                                        com.163.money:http/                        column=mk:dist, timestamp=1423550107073, value=0                                                                            com.163.money:http/                        column=mtdt:_csh_, timestamp=1423550107073, value=?\x80\x00\x00                                                              com.163.money:http/                        column=s:s, timestamp=1423550107073, value=?\x80\x00\x00                                                                   1 row(s) in 0.4090 seconds


(4)将表中内容读取到/mnt/jediael/2
bin/nutch readdb  -dump /mnt/jediael/2  -crawlId test001 -content 

(5)查看/mnt/jediael/2中的内容
$ lltotal 4-rwxrwxrwx. 1 jediael jediael 344 Feb 10 14:41 part-r-00000-rwxrwxrwx. 1 jediael jediael   0 Feb 10 14:41 _SUCCESS

$ cat part-r-00000http://money.163.com/   key:    com.163.money:http/baseUrl:        nullstatus: 0 (null)fetchTime:      1423550105558prevFetchTime:  0fetchInterval:  2592000retriesSinceFetch:      0modifiedTime:   0prevModifiedTime:       0protocolStatus: (null)parseStatus:    (null)title:  nullscore:  1.0marker _injmrk_ :       ymarker dist :   0reprUrl:        nullmetadata _csh_ :        ?锟







0 0