spark-streaming-[1]-streaming基础NetworkWordCount

来源：互联网发布：mac点不开 app store 编辑：程序博客网时间：2024/06/06 02:22

一、编程框架

Define context

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

After a context is defined, you have to do the following.

1. Define the input sources by creating input DStreams.
例如:
val lines = ssc.socketTextStream("localhost", 9999)

2. Define the streaming computations by applying transformation and output operations to DStreams.
Basic Sources:
[1]File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
[2]Streams based on Custom Actors: DStreams can be created with data streams received through Akka actors by using streamingContext.actorStream(actorProps, actor-name)
[3]Queue of RDDs as a Stream

3. Start receiving data and processing it using streamingContext.start().
4. Wait for the processing to be stopped (manually or due to any error) using
streamingContext.awaitTermination().
5. The processing can be manually stopped using streamingContext.stop().

二、注意事项

设置local [n] > = stream源数

Points to remember
Once a context has been started, no new streaming computations can be set up or added to it. Once a context has been stopped, it cannot be restarted.

Only one StreamingContext can be active in a JVM at the same time.stop() on StreamingContext also stops the SparkContext.

To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.

A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.

三、Basic Sources

[1]TCP socket
[2]File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.)
Streams based on Custom Actors: DStreams can be created with data streams received through Akka actors by using streamingContext.actorStream(actorProps, actor-name).
[3]Queue of RDDs as a Stream

NetCat测试一

官方实例：NetworkWordCount

1：运行程序

2：//启动netcat serverroot@sparkmaster:~/streaming# nc -lk 9999

3：输入单词

package com.dt.spark.main.Streaming.tcpimport org.apache.log4j.{Logger, Level}import org.apache.spark._import org.apache.spark.streaming._ // not necessary since Spark 1.3/**  * Created by hjw on 17/4/17.  *//*After a context is defined, you have to do the following.1. Define the input sources by creating input DStreams.例如:    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")    val ssc = new StreamingContext(conf, Seconds(1))2. Define the streaming computations by applying transformation and output operations to DStreams.例如:    val lines = ssc.socketTextStream("localhost", 9999)Basic Sources:[1]File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:[2]Streams based on Custom Actors: DStreams can be created with data streams received through Akka actors by using streamingContext.actorStream(actorProps, actor-name)[3]Queue of RDDs as a Stream3. Start receiving data and processing it using streamingContext.start().4. Wait for the processing to be stopped (manually or due to any error) usingstreamingContext.awaitTermination().5. The processing can be manually stopped using streamingContext.stop().Points to remember:Once a context has been started, no new streaming computations can be set up or added to it. Once a context has been stopped, it cannot be restarted.Only one StreamingContext can be active in a JVM at the same time.stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.Points to remember设置local [n]  > =  stream源数When running a Spark Streaming program locally, do not use “local” or “local[1]” asthe master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run*/object NetworkWordCount {  Logger.getLogger("org").setLevel(Level.ERROR)  def main(args: Array[String]): Unit ={    // Create a local StreamingContext with two working thread and batch interval of 1 second.    // The master requires 2 cores to prevent from a starvation scenario.    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")    val ssc = new StreamingContext(conf, Seconds(1))    // Create a DStream that will connect to hostname:port, like localhost:9999    val lines = ssc.socketTextStream("localhost", 9999)    // Split each line into words    val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch    val pairs = words.map(word => (word, 1))    val wordCounts = pairs.reduceByKey(_ + _)    // Print the first ten elements of each RDD generated in this DStream to the console    wordCounts.print()    ssc.start() // Start the computation    ssc.awaitTermination() // Wait for the computation to terminate  }}//-------------------------------------------//Time: 1492433839000 ms//-------------------------------------------//(word,1)//(hello,1)//-------------------------------------------//Time: 1492433903000 ms//-------------------------------------------//(word,1)//(hello,1)

socket模拟spout测试二

先运行以下模拟器再运行上面的NetworkWordCount

NetworkWordCountData.txt中内容如下

hello  worldhello  javahello  chello  c++hjw   hjw

program arguments：./srcFile/NetworkWordCountData.txt 9999 1000

输出如下：

-------------------------------------------
Time: 1493642176000 ms
-------------------------------------------
-------------------------------------------
Time: 1493642177000 ms
-------------------------------------------
-------------------------------------------
Time: 1493642178000 ms
-------------------------------------------
(,1)
(hello,1)
(world,1)
-------------------------------------------
Time: 1493642179000 ms
-------------------------------------------
(,1)
(hello,1)
(java,1)
-------------------------------------------
Time: 1493642180000 ms
-------------------------------------------
(,2)
(hjw,2)
-------------------------------------------
Time: 1493642181000 ms
-------------------------------------------
(,1)
(hello,1)
(java,1)

package com.dt.spark.main.Streaming.tcpimport java.io.PrintWriterimport java.net.ServerSocketimport scala.io.Source/**  * Created by hjw on 17/5/1.  */object StreamingSimulation {  /*  随机取整函数   */  def index(length:Int) ={    import java.util.Random    val rdm = new Random()    rdm.nextInt(length)  }  def main(args: Array[String]) {    if (args.length != 3){      System.err.println("Usage: <filename><port><millisecond>")      System.exit(1)    }    val filename = args(0)    val lines = Source.fromFile(filename).getLines().toList    val fileRow = lines.length    val listener = new ServerSocket(args(1).toInt)    //指定端口,当有请求时建立连接    while(true){      val socket = listener.accept()      new Thread(){        override def run() = {          println("Got client connect from: " + socket.getInetAddress)          val out =  new PrintWriter(socket.getOutputStream,true)          while(true){            Thread.sleep(args(2).toLong)            //随机发送一行数据至client            val content = lines(index(fileRow))            println(content)            out.write(content + '\n')            out.flush()          }          socket.close()        }      }.start()    }  }}

0 0