Spark组件之Spark Streaming学习2--StatefulNetworkWordCount 学习
来源:互联网 发布:淘宝宝贝搬家 编辑:程序博客网 时间:2024/05/21 07:00
更多代码请见:https://github.com/xubo245/SparkLearning
运行方法类似:http://blog.csdn.net/xubo245/article/details/51251970
1.理解
StatefulNetworkWordCount 与NetworkWordCount不同的是会进行state标记,然后wordCount是累计,而不是只求一个batch
累计的实现是:updateStateByKey,里面有调用newUpdateFunc函数:
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc, new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
newUpdateFunc:
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) }
2.运行:
nc和run分别启动
输入:
hadoop@Mcnode6:~$ nc -lk 9999hellOworldhellowaaaaaaaaaaahellohellohellohelloaaaaaabbbbbbb
-------------------------------------------Time: 1461662247000 ms-------------------------------------------(hellO,1)(,1)(hello,5)(waaa,1)(world,2)(a,13)(ahello,1)
-------------------------------------------Time: 1461662250000 ms-------------------------------------------(b,6)(hellO,1)(,2)(hello,5)(waaa,1)(world,2)(a,13)(ahello,1)
累计WordCount的过程
3.源码:
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */// scalastyle:off printlnpackage org.apache.spark.Streaming.learningimport org.apache.spark.SparkConfimport org.apache.spark.HashPartitionerimport org.apache.spark.streaming._import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctionsimport scala.Option.option2Iterable/** * Counts words cumulatively in UTF8 encoded, '\n' delimited text received from the network every * second starting with initial value of word count. * Usage: StatefulNetworkWordCount <hostname> <port> * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive * data. * * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example * org.apache.spark.examples.streaming.StatefulNetworkWordCount localhost 9999` */object StatefulNetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) } val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount") // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint(".") // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the // words in input stream of \n delimited test (eg. generated by 'nc') val lines = ssc.socketTextStream(args(0), args(1).toInt) val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey // This will give a Dstream made of state (which is the cumulative count of the words) val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc, new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD) stateDstream.print() ssc.start() ssc.awaitTermination() }}// scalastyle:on println
0 0
- Spark组件之Spark Streaming学习2--StatefulNetworkWordCount 学习
- Spark组件之Spark Streaming学习1--NetworkWordCount学习
- Spark组件之Spark Streaming学习4--HdfsWordCount 学习
- Spark组件之Spark Streaming学习5--WindowsWordCount学习
- Spark学习笔记之-Spark-Streaming
- Spark学习之Spark Streaming(9)
- Spark学习之16:Spark Streaming执行流程(2)
- spark学习笔记:Spark Streaming
- Spark学习六:spark streaming
- Spark组件之Spark Streaming学习3--结合SparkSQL的使用(wordCount)
- Spark组件之Spark Streaming学习6--如何调用Dstream里面的getOrCompute方法?
- Spark Streaming 再学习
- Spark Streaming学习笔记
- 4.Spark Streaming学习
- Spark Streaming学习
- Spark Streaming 学习笔记
- Spark Streaming学习笔记
- Spark学习之15:Spark Streaming执行流程(1)
- linux系统编程之进程(二):进程生命周期与PCB(进程控制块)
- mysql中间件研究(Atlas,cobar,TDDL)
- C#接口作用的深入理解
- 前端推荐学习网站
- openstack image guide总结
- Spark组件之Spark Streaming学习2--StatefulNetworkWordCount 学习
- nyoj 19 擅长排列的小明<按序排列>
- gpg文件加密,签名
- Android ListView用EditText实现搜索功能
- 进程互斥的硬件解决方案
- JVM内存模型以及HotSpot的GC策略
- 根据Path对Bitmap进行截取
- 欢迎使用CSDN-markdown编辑器
- NSSecureCoding协议进行对象编解码