Spark Streaming从Kafka中获取数据,并进行实时单词统计,统计URL出现的次数
来源:互联网 发布:wireshark抓端口 编辑:程序博客网 时间:2024/05/28 06:07
1、创建Maven项目
创建的过程参考:http://blog.csdn.net/tototuzuoquan/article/details/74571374
2、启动Kafka
A:安装kafka集群:http://blog.csdn.net/tototuzuoquan/article/details/73430874
B:创建topic等:http://blog.csdn.net/tototuzuoquan/article/details/73430874
3、编写Pom文件
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>cn.toto.spark</groupId><artifactId>bigdata</artifactId><version>1.0-SNAPSHOT</version><properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <encoding>UTF-8</encoding> <scala.version>2.10.6</scala.version> <spark.version>1.6.2</spark.version> <hadoop.version>2.6.4</hadoop.version></properties><dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-reflect</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume_2.10</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>${spark.version}</version> </dependency></dependencies><build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-make:transitive</arg> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.18.1</version> <configuration> <useFile>false</useFile> <disableXmlReport>true</disableXmlReport> <includes> <include>**/*Test.*</include> <include>**/*Suite.*</include> </includes> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.4.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>cn.toto.spark.FlumeStreamingWordCount</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> </plugins></build></project>
4.编写代码
package cn.toto.sparkimport cn.toto.spark.streams.LoggerLevelsimport org.apache.spark.{HashPartitioner, SparkConf}import org.apache.spark.storage.StorageLevelimport org.apache.spark.streaming.dstream.ReceiverInputDStreamimport org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}/** * Created by toto on 2017/7/13. * 从kafka中读数据,并且进行单词数量的计算 */object KafkaWordCount { /** * String :单词 * Seq[Int] :单词在当前批次出现的次数 * Option[Int] :历史结果 */ val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => { //iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x))) iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(i => (x, i)) } } def main(args: Array[String]): Unit = { LoggerLevels.setStreamingLogLevels() //这里的args从IDEA中传入,在Program arguments中填写如下内容: //参数用一个数组来接收: //zkQuorum :zookeeper集群的 //group :组 //topic :kafka的组 //numThreads :线程数量 //hadoop11:2181,hadoop12:2181,hadoop13:2181 g1 wordcount 1 要注意的是要创建line这个topic val Array(zkQuorum, group, topics, numThreads) = args val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]") val ssc = new StreamingContext(sparkConf,Seconds(5)) ssc.checkpoint("E:\\wordcount\\outcheckpoint") //"alog-2016-04-16,alog-2016-04-17,alog-2016-04-18" //"Array((alog-2016-04-16, 2), (alog-2016-04-17, 2), (alog-2016-04-18, 2))" val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap //保存到内存和磁盘,并且进行序列化 val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap, StorageLevel.MEMORY_AND_DISK_SER) //从kafka中写数据其实也是(key,value)形式的,这里的_._2就是value val words = data.map(_._2).flatMap(_.split(" ")) val wordCounts = words.map((_, 1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true) wordCounts.print() ssc.start() ssc.awaitTermination() }}
5.配置IDEA中运行的参数:
配置说明:
hadoop11:2181,hadoop12:2181,hadoop13:2181 g1 wordcount 1hadoop11:2181,hadoop12:2181,hadoop13:2181 :zookeeper集群地址g1 :组wordcount :kafka的topic1 :线程数为1
6、创建kafka,并在kafka中传递参数
启动kafka
[root@hadoop1 kafka]# pwd/home/tuzq/software/kafka/servers/kafka[root@hadoop1 kafka]# bin/kafka-server-start.sh config/server.properties 1>/dev/null 2>&1 &
创建topic
[root@hadoop1 kafka]# bin/kafka-topics.sh --create --zookeeper hadoop11:2181 --replication-factor 1 --partitions 1 --topic wordcountCreated topic "wordcount".
查看主题
bin/kafka-topics.sh --list --zookeeper hadoop11:2181
启动一个生产者发送消息(我的kafka在hadoop1,hadoop2,hadoop3这几台机器上)
[root@hadoop1 kafka]# bin/kafka-console-producer.sh --broker-list hadoop1:9092 --topic wordcountNo safe wading in an unknown waterAnger begins with folly,and ends in repentanceNo safe wading in an unknown waterAnger begins with folly,and ends in repentanceAnger begins with folly,and ends in repentance
使用spark-submit来运行程序
#启动spark-streaming应用程序bin/spark-submit --class cn.toto.spark.KafkaWordCount /root/streaming-1.0.jar hadoop11:2181 group1 wordcount 1
7、查看运行结果
8、再如统计URL出现的次数
package cn.toto.sparkimport org.apache.spark.{HashPartitioner, SparkConf}import org.apache.spark.storage.StorageLevelimport org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}/** * Created by toto on 2017/7/14. */object UrlCount { val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap{case(x,y,z)=> Some(y.sum + z.getOrElse(0)).map(n=>(x, n))} } def main(args: Array[String]) { //接收命令行中的参数 val Array(zkQuorum, groupId, topics, numThreads, hdfs) = args //创建SparkConf并设置AppName val conf = new SparkConf().setAppName("UrlCount") //创建StreamingContext val ssc = new StreamingContext(conf, Seconds(2)) //设置检查点 ssc.checkpoint(hdfs) //设置topic信息 val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap //重Kafka中拉取数据创建DStream val lines = KafkaUtils.createStream(ssc, zkQuorum ,groupId, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2) //切分数据,截取用户点击的url val urls = lines.map(x=>(x.split(" ")(6), 1)) //统计URL点击量 val result = urls.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true) //将结果打印到控制台 result.print() ssc.start() ssc.awaitTermination() }}
阅读全文
0 0
- Spark Streaming从Kafka中获取数据,并进行实时单词统计,统计URL出现的次数
- 统计一篇英文文件中,单词出现的次数,并按单词的长度进行排序
- Spark Streaming从Kafka自定义时间间隔内实时统计行数、TopN并将结果存到hbase中
- Spark Streaming从Kafka自定义时间间隔内实时统计行数、TopN并将结果存到hbase中
- 统计并输出英语短文中单词出现的次数
- Spark Streaming+kafka订单实时统计实现
- 从控制台获取一串字符串,并对每个字符出现的次数进行统计
- 统计文章中单词出现的次数
- 统计文章中单词出现的次数
- 统计单词出现的次数并按单词出现的次数顺序输出单词及其次数
- Java实现 统计单词出现的次数并按照单词频率从高到低输出
- spark streaming统计kafka数据计数不准的问题
- 初探map()——对一个文件进行统计其中各个单词出现的次数,并按次数从高到低的顺序进行排序
- 统计单词出现的次数
- 统计单词出现的次数
- 统计单词出现的次数
- 统计单词出现的次数。
- 统计文件中数据出现的次数并排序
- POJ 2406 字符串a的n次方 kmp
- Spring项目中无法引入@Resource注解
- php开发环境介绍
- linux 防火墙开启和配置防火墙端口
- EasyUI下拉框的使用和三目运算符的使用(基础)
- Spark Streaming从Kafka中获取数据,并进行实时单词统计,统计URL出现的次数
- 常用类笔记
- emacs evil-matchit实现verilog配对的代码跳转
- 小白笔记----------------------leetcode(100 Same Tree)
- 面向对象-封装
- 基于Redis的分布式锁到底安全吗?
- 基于emoji 国际通用表情在web上的输入与显示的记录
- AD 技巧
- 使用IntelliJ IDEA 配置Maven(入门)