crunch学习一
来源:互联网 发布:php初级程序员面试题 编辑:程序博客网 时间:2024/06/06 11:45
最近在学习crunch
先附上官网文档地址http://crunch.apache.org/user-guide.html
首先是学习了一下getstart
然后才是user-guide
简述一下吧,作为一个笔记
- dofn的运行过程,首先是从TaskInputOutputContext中获取到输入参数,然后通过dofn中的initablize方法进行初始化,之后就是调用process方法进行逻辑的处理并且通过Emitter<T>进行输出结果,最后有一个cleanup的方法
- dofn运行时可以通过getConfiguration() progress()setStatus(String status) 当然也有计数器increment(String groupName, String counterName)
- 如何配置dnfn的执行计划 通过scaleFactor方法可以影响到每个dofn的输入具体是多少,这个方法的返回值是float类型,具体意思应该是用于控制map和reduce的个数,具体的我还没有做测试,这个只是猜测. configure(Configuration conf)方法是用于一些设置
- dofn的一些常用内容
- mapfn用于原样输出
- 使用pcollection中的by方法 传入mapfn<k,v> 可以将pcollection转换成为kv的形式,mapfn的作用是生成key, 值是之前pcollection中的值
- 使用上述的by方法生成的是ptable,ptable有mapkey和mapvalue方法,这个是对ptable中的key value进行做转换
附上代码
package com.hit.crunch;
import java.util.Set;
import org.apache.crunch.DoFn;
import org.apache.crunch.Emitter;
import org.apache.crunch.FilterFn;
import org.apache.crunch.PCollection;
import org.apache.crunch.PTable;
import org.apache.crunch.Pipeline;
import org.apache.crunch.PipelineResult;
import org.apache.crunch.impl.mr.MRPipeline;
import org.apache.crunch.scrunch.Writables;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.google.common.base.Splitter;
import com.google.common.collect.ImmutableSet;
public class WordCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
String a[] = new String[]{"/zhonghui/input","/zhonghui/output"};
Configuration conf = new Configuration();
conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
ToolRunner.run(conf, new WordCount(), a);
}
@Override
public int run(String[] args) throws Exception {
String inputPath = args[0];
String outPath = args[1];
Pipeline pipeline = new MRPipeline(WordCount.class, getConf());
PCollection<String> lines = pipeline.readTextFile(inputPath);
PCollection<String> words = lines.parallelDo(new Tokenizer(),
Writables.strings());
PCollection<String> noStopWords = words.filter(new StopWordFilter());
PTable<String, Long> counts = noStopWords.count();
pipeline.writeTextFile(counts, outPath);
PipelineResult result = pipeline.done();
return result.succeeded() ? 0 : 1;
}
}
class StopWordFilter extends FilterFn<String> {
// English stop words, borrowed from Lucene.
private static final Set<String> STOP_WORDS = ImmutableSet
.copyOf(new String[] { "a", "and", "are", "as", "at", "be", "but",
"by", "for", "if", "in", "into", "is", "it", "no", "not",
"of", "on", "or", "s", "such", "t", "that", "the", "their",
"then", "there", "these", "they", "this", "to", "was",
"will", "with" });
@Override
public boolean accept(String word) {
return !STOP_WORDS.contains(word);
}
}
class Tokenizer extends DoFn<String, String> {
private static final Splitter SPLITTER = Splitter.onPattern("\\s+")
.omitEmptyStrings();
@Override
public void process(String line, Emitter<String> emitter) {
for (String word : SPLITTER.split(line)) {
emitter.emit(word);
}
}
}
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>crunch</groupId>
<artifactId>crunch</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.crunch/crunch-core -->
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-core</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-test</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-hbase</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-spark</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch-examples</artifactId>
<version>0.14.0</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.6</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
- Crunch 学习(一)
- crunch学习一
- Crunch学习(二)
- Hello Crunch
- crunch使用说明
- select Crunch gym locations
- Unknown Command 'crunch 解决办法
- ERROR: Unknown command 'crunch'
- Failed to crunch file
- fail to crunch file
- FREEBSD中使用crunch编译程序
- ERROR: UNKNOWN COMMAND 'CRUNCH' 解决方法
- ERROR: Unknown command 'crunch' 解决方法
- ERROR: Unknown command 'crunch' 解决方法
- Apache Crunch设计:基础数据处理
- crunch - 根据字符集生成字典
- Crunch字典生成器的使用
- kali渗透测试工具:Crunch
- POJ 3175 Finding Bovine Roots
- 基类抽取
- POJ 3046 Ant Counting(dp----多重集组合数)
- 【HDU】-1181-变形课(DFS)
- 日夜间切换
- crunch学习一
- shell中自动化交互实现--一般用户脚本自动切换
- Java IO流分析整理
- ios动态修改title无效的处理方式
- matlab中eig和eigs函数的引用
- 6410 spi 设备驱动
- 线段树小结
- 程序启动顺序ios
- 2016年暑假集训周赛#1题解