spark学习03之wordCount统计并排序(java)
来源:互联网 发布:2015全球社交网络排名 编辑:程序博客网 时间:2024/06/07 03:20
wordCount就是对一大堆单词进行个数统计,然后排序。从网上找篇英文文章放到本地文档。
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.fei</groupId> <artifactId>word-count</artifactId> <version>0.0.1-SNAPSHOT</version> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.3.0</version></dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.0.2</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project>
WordCount.java
package com.fei;import java.util.Arrays;import java.util.List;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.FlatMapFunction;import org.apache.spark.api.java.function.Function2;import org.apache.spark.api.java.function.PairFunction;import scala.Tuple2;/** * 单词统计,并按降序排序,输出前10个单词及个数 * @author Jfei * */public class WordCount {public static void main(String[] args) {//1.本地模式,创建spark配置及上下文SparkConf conf = new SparkConf().setAppName("wordCount").setMaster("local");JavaSparkContext sc = new JavaSparkContext(conf);//2.读取本地文件,并创建RDDJavaRDD<String> linesRDD = sc.textFile("e:\\words.txt");//3.每个单词由空格隔开,将每行的linesRDD拆分为每个单词的RDDJavaRDD<String> wordsRDD = linesRDD.flatMap(s -> Arrays.asList(s.split("\\s")));//相当于 ==>/*JavaRDD<String> wordsRDD = linesRDD.flatMap(new FlatMapFunction<String, String>(){private static final long serialVersionUID = 1L;@Overridepublic Iterable<String> call(String line) throws Exception {return Arrays.asList(line.split(" "));}});*///4.将每个单词转为key-value的RDD,并给每个单词计数为1JavaPairRDD<String,Integer> wordsPairRDD = wordsRDD.mapToPair(s -> new Tuple2<String,Integer>(s, 1));//相当于 ==>/*JavaPairRDD<String,Integer> wordsPairRDD = wordsRDD.mapToPair(new PairFunction<String, String, Integer>() {private static final long serialVersionUID = 1L;@Overridepublic Tuple2<String, Integer> call(String word) throws Exception {return new Tuple2<String,Integer>(word,1);}});*///5.计算每个单词出现的次数 JavaPairRDD<String,Integer> wordsCountRDD = wordsPairRDD.reduceByKey((a,b) -> a+b);//相当于 ==>/*JavaPairRDD<String,Integer> wordsCountRDD = wordsPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {@Overridepublic Integer call(Integer v1, Integer v2) throws Exception {return v1 + v2;}});*/ //6.因为只能对key进行排序,所以需要将wordsCountRDD进行key-value倒置,返回新的RDD JavaPairRDD<Integer,String> wordsCountRDD2 = wordsCountRDD.mapToPair(s -> new Tuple2<Integer,String>(s._2, s._1)); //相当于 ==> /*JavaPairRDD<Integer,String> wordsCountRDD2 = wordsCountRDD.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() {private static final long serialVersionUID = 1L;@Overridepublic Tuple2<Integer, String> call(Tuple2<String, Integer> t) throws Exception {return new Tuple2<Integer,String>(t._2,t._1);}});*/ //7.对wordsCountRDD2进行排序,降序desc JavaPairRDD<Integer,String> wordsCountRDD3 = wordsCountRDD2.sortByKey(false); //8.只取前10个 List<Tuple2<Integer, String>> result = wordsCountRDD3.take(10); //9.打印 result.forEach(t -> System.out.println(t._2 + " " + t._1)); sc.close(); }}
如果JDK不是1.8的,那修改下pom.xml及代码中不要使用lambda表达式
阅读全文
0 0
- spark学习03之wordCount统计并排序(java)
- spark学习02之app流量统计并排序(JAVA)
- Spark学习之WordCount
- Spark创建WordCount并统计词频
- Spark实例WordCount(统计+排序)
- spark学习之WordCount测试
- Spark之java操作WordCount
- Spark之java操作WordCount
- 利用Java的Spark做单词统计并排序
- spark版WordCount(Java),将输出结果排序,并去除输出文件中的括号。
- spark helloworld (wordCount实现并按照词频排序)
- spark:学习杂记+wordcount(单词统计)--22
- Spark之WordCount(Java代码实现)
- Python开发Spark应用之Wordcount词频统计
- spark(2)-入门spark之java maven wordcount实验
- Spark学习1-wordcount
- spark入门之wordcount
- Spark之WordCount
- L1-022. 奇偶分家
- MySQL导入csv文件
- 全排列
- JS删除一个数组中满足条件的所有数据
- VMware Ubuntu安装详细过程
- spark学习03之wordCount统计并排序(java)
- php5.3x不再支持ereg和eregi
- Spring Boot 使用465端口发送邮件
- cmd中netsh工具的使用
- Rxjava2 笔记
- Codeforces Round #443 (Div. 2) A、B、C 位运算
- ES查询-match VS match_phrase
- hadoop 等一系列问题
- os模块关于路径