Nutch2 WebPage写入数据库的过程分析
来源:互联网 发布:手里淘宝信誉度怎么看 编辑:程序博客网 时间:2024/06/05 23:54
版本: Nutch 2.2.1
本文通过InjectJob来追踪webpage的定义、创建、传递、序列化、写入数据库的整个过程。从源码中摘录了重要的代码行,并标明其所在文件名、行号。
本文通过InjectJob来追踪webpage的定义、创建、传递、序列化、写入数据库的整个过程。从源码中摘录了重要的代码行,并标明其所在文件名、行号。
1. 定义 schema
schema直接写在源代码里面:
//file: org/apache/nutch/storage/WebPage.java//line: 42public class WebPage extends PersistentBase { public static final Schema _SCHEMA = Schema.parse("{\"type\":\"record\",\"name\":\"WebPage\",\"namespace\":\"org.apache.nutch.storage\",\"fields\":[{\"name\":\"baseUrl\",\"type\":\"string\"},{\"name\":\"status\",\"type\":\"int\"},{\"name\":\"fetchTime\",\"type\":\"long\"},{\"name\":\"prevFetchTime\",\"type\":\"long\"},{\"name\":\"fetchInterval\",\"type\":\"int\"},{\"name\":\"retriesSinceFetch\",\"type\":\"int\"},{\"name\":\"modifiedTime\",\"type\":\"long\"},{\"name\":\"prevModifiedTime\",\"type\":\"long\"},{\"name\":\"protocolStatus\",\"type\":{\"type\":\"record\",\"name\":\"ProtocolStatus\",\"fields\":[{\"name\":\"code\",\"type\":\"int\"},{\"name\":\"args\",\"type\":{\"type\":\"array\",\"items\":\"string\"}},{\"name\":\"lastModified\",\"type\":\"long\"}]}},{\"name\":\"content\",\"type\":\"bytes\"},{\"name\":\"contentType\",\"type\":\"string\"},{\"name\":\"prevSignature\",\"type\":\"bytes\"},{\"name\":\"signature\",\"type\":\"bytes\"},{\"name\":\"title\",\"type\":\"string\"},{\"name\":\"text\",\"type\":\"string\"},{\"name\":\"parseStatus\",\"type\":{\"type\":\"record\",\"name\":\"ParseStatus\",\"fields\":[{\"name\":\"majorCode\",\"type\":\"int\"},{\"name\":\"minorCode\",\"type\":\"int\"},{\"name\":\"args\",\"type\":{\"type\":\"array\",\"items\":\"string\"}}]}},{\"name\":\"score\",\"type\":\"float\"},{\"name\":\"reprUrl\",\"type\":\"string\"},{\"name\":\"headers\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"outlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"inlinks\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"markers\",\"type\":{\"type\":\"map\",\"values\":\"string\"}},{\"name\":\"metadata\",\"type\":{\"type\":\"map\",\"values\":\"bytes\"}},{\"name\":\"batchId\",\"type\":\"string\"}]}");//..public Schema getSchema() { return _SCHEMA; }//...}
这是一个json格式的字符串,由avro负责解析
2. 传递Schema
这一过程在提交job之前的初始化阶段进行
//file: org/apache/nutch/crawl/InjectorJob.java//InjectorJob.run(Map<String,Object>) line: 221 { DataStore<String, WebPage> store = StorageUtils.createWebStore(currentJob.getConfiguration(), String.class, WebPage.class);}
一层层的传递persistentClass
//file: gora-core-0.2.1/org/apache/gora/store/DataStoreFactory.java//DataStoreFactory.createDataStore(Class<D>, Class<K>, Class<T>, Configuration, String) line: 135 { return createDataStore(dataStoreClass, keyClass, persistent, conf, createProps(), schemaName);}
gora调用WebPage.getSchema() ,获取了Schema
//file: gora-core-0.2.1/org/apache/gora/store/DataStoreBase.java//SqlStore<K,T>(DataStoreBase<K,T>).initialize(Class<K>, Class<T>, Properties) line: 81 { schema = this.beanFactory.getCachedPersistent().getSchema(); fieldMap = AvroUtils.getFieldMap(schema);}
3. 传递数据、序列化
这一过程在Map阶段进行
Map方法创建webpage(row),并在最后输出到context
//file: org/apache/nutch/crawl/InjectorJob.java//InjectorJob$UrlMapper.map(LongWritable, Text, Mapper<LongWritable,Text,String,Contex>) line: 191 { context.write(reversedUrl, row);}
hadoop core 逐层传递webpage
//file: hadoop-src/org/apache/hadoop/mapred/MapTask.java//MapTask$NewDirectOutputCollector<K,V>.write(K, V) line: 638 { reporter.progress(); long bytesOutPrev = getOutputBytes(fsStats); out.write(key, value);}
上面的out对象的类型是GoraRecoreWriter
//file: gora-core-0.2.1/org/apache/gora/mapreduce/GoraRecordWriter.java//GoraRecordWriter<K,T>.write(K, T) line: 60 { store.put(key, (Persistent) value);}
对象store的实际类型为SqlStore,继承自Gora-core的DataStoreBase类,负责对Mysql的读写。K是主键,T是一个WebPage对象,先写到cache里面。
//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java//SqlStore<K,T>.put(K, T) line: 616 public void put(K key, T persistent) { List<Field> fields = schema.getFields(); for (int i = 0; i < fields.size(); i++) { Field field = fields.get(i); Column column = mapping.getColumn(field.name()); insertStatement.setObject(persistent.get(i), field.schema(), column); } //jdbc already should cache the ps PreparedStatement insert = insertStatement.toStatement(connection); synchronized (writeCache) { writeCache.add(insert); } }
toStatement()里面调用了setField(),序列化操作由avro实现,这里暂不深入
//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java//SqlStore<K,T>.setField(PreparedStatement, Column, Schema, int, Object) line: 718{ IOUtils.serialize(os, datumWriter, schema, object);}
4. flush操作
//file: hadoop-src/org/apache/hadoop/mapred/MapTask.java//MapTask.runNewMapper(JobConf, TaskSplitIndex, TaskUmbilicalProtocol, TaskReporter) line: 767{ output.close(mapperContext);}//file: gora-core-0.2.1/org/apache/gora/mapreduce/GoraRecordWriter.java//GoraRecordWriter<K,T>.close(TaskAttemptContext) line: 55 { store.close();}
下面是SqlStore.close()内调用的flush()方法:
//file: gora-sql-0.1.1-incubating/org/apache/gora/sql/store/SqlStore.java//SqlStore<K,T>.flush() line: 342{ connection.commit();}
至此,webpage被写入Mysql数据库 (底层是调用jdbc)
0 0
- Nutch2 WebPage写入数据库的过程分析
- Nutch2.2.1之MySQL数据库的配置和建表webpage语句
- Nutch2 WebPage 字段解释
- Nutch2 WebPage 字段解释
- Nutch2.2可以使用的webpage建表语句
- (转载)Nutch2 WebPage 字段解释
- oracle数据库写入数据的过程
- 【Nutch2.2.1源代码分析之5】索引的基本流程
- 【Nutch2.2.1源代码分析之5】索引的基本流程
- WebPage类的源代码
- WebPage类的源代码
- WebPage
- sql 存储过程接收不定量参数 比较分析写入数据库
- nutch2.0 配置mysql数据库
- nutch-2.0执行过程及hbase下webpage的变化更新
- 使用存储过程、事务、webpage实现转账。
- hdfs文件写入过程流程分析
- nutch2.1在windows平台上使用eclipsedebug 存储在mysql的搭建过程
- 表空间的传输
- 监听EditText内容变化
- Reginal2011_Chengdu_I Isabella's Message
- 剑指offer05题链表的倒序输出
- pl sql优化策略
- Nutch2 WebPage写入数据库的过程分析
- 接口的作用|实例讲解java接口
- 游戏服务器之配置内存管理
- hdu3033I love sneakers! 分组背包
- goto语句对析构函数的影响
- 九度题目1202:排序
- 图片处理好帖
- 写给新站长及找不到网站推广方法的朋友
- hdu2089 & hdu3555 基础数位DP