专题：如何写测试——Spark

来源：互联网发布：新疆人知乎编辑：程序博客网时间：2024/04/30 03:59

Spark测试与写普通的程序流程是一样的，稍加设置即可。下面以scalatest为测试工具简单介绍一下Spark测试的写法：

1. 前置条件：scalatest

并不是一定要使用scalatest，junit也能用，但是那样就太不Scala了。ScalaTest有点DSL的意思，最开始接触的时候还是让人很难受的，习惯就好，习惯就好，也并没有那么不能接受。

添加依赖

<dependency>  <groupId>org.scalatest</groupId>  <artifactId>scalatest_2.10</artifactId>  <version>2.2.4</version>  <scope>test</scope></dependency>

选择测试的样式(Selecting testing styles)
推荐FlatSpec，这样显得Scala一点。如果实在是有困难可以考虑Funsuite式的，接受起来应该容易一些。下面的例子均是FlatSpec
基本测试模式

import org.scalatest._class IPv4Spec extends FlatSpec with Matchers{  "ipv4" should "retain ip part" in {    RealtimeTracker.ipv4.findFirstIn("10.201.10.2:4531") should be (Some("10.201.10.2"))    RealtimeTracker.ipv4.findFirstIn("10.201.10.2") should be (Some("10.201.10.2"))  }}

文件名记得定义为xxxxSpec，同事说函数式语言里面都这样……

跑测试
1. IDE: ScalaIDE(Eclipse)与IntelliJ IDEA中均支持直接在IDE中跑测试，点小测试跑单个小测试，点文件跑整个文件的所有测试。
2. Maven: 需要有scalatest插件支持，具体坐标如下

<!-- disable surefire --><plugin>    <groupId>org.apache.maven.plugins</groupId>    <artifactId>maven-surefire-plugin</artifactId>    <version>2.7</version>    <configuration>        <skipTests>true</skipTests>    </configuration></plugin><!-- enable scalatest --><plugin>    <groupId>org.scalatest</groupId>    <artifactId>scalatest-maven-plugin</artifactId>    <version>1.0</version>    <configuration>        <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>        <junitxml>.</junitxml>        <filereports>WDF TestSuite.txt</filereports>    </configuration>    <executions>        <execution>            <id>test</id>            <goals>                <goal>test</goal>            </goals>        </execution>    </executions></plugin>

2. 利用ScalaTest写Spark测试

2.1 构造数据

　　写MapReduce测试如果不用MRUnit就需要构造小文件，对于Spark来说就有更简单的方法了

  val rdd = sc.parallelize(Seq(1, 2, 3))

有了这个手工构造的数据集，就可以开始基于RDD做测试了。

2.2 测试的基础环境

import org.apache.spark._import org.scalatest._trait SparkSpec extends BeforeAndAfterAll {  this: Suite =>  private val master = "local[2]"  private val appName = this.getClass.getSimpleName  private var _sc: SparkContext = _  def sc = _sc  val conf: SparkConf = new SparkConf()    .setMaster(master)    .setAppName(appName)    .set("spark.driver.allowMultipleContexts", "true")    .set("spark.ui.enabled", "false") // 去掉UI  override def beforeAll(): Unit = {    super.beforeAll()    _sc = new SparkContext(conf)  }  override def afterAll(): Unit = {    if (_sc != null) {      _sc.stop()      _sc = null    }    super.afterAll()  }}

有了这个trait之后，继承下来就可以拥有一个可操作的sc变量（函数），而初始化和扫尾则有trait类来保证。

2.3 测试案例

import org.scalatest._class WordCountSpec extends FlatSpec with SparkSpec with Matchers {  "words" should "be counted" in {    val counts = sc.parallelize(Seq("a b c", "a b d"))      .flatMap(line => line.split("\\s").map(s => (s, 1)))      .reduceByKey(_ + _)      .collectAsMap()    counts should contain("a" -> 2)    counts should contain theSameElementsAs (Map("a" -> 2, "b" -> 2, "c" -> 1, "d" -> 1))  }}

2.4 Spark其它部件的测试

在想办法做Spark Streaming的测试的时候找到了mkuthan这个项目，还是不错的。Spark Streaming做测试时需要调整每一个样本进入Stream的时间，在这个项目中做了一些手脚完成了时间的手工控制。同样的，Spark Streaming用于测试之前需要对程序进行　面向测试 的分解，如果main函数写到底是没办法测的。从spark-unit-testint这里可以看出，首先需要做的是对程序进行分解，第一步是完成与Spark无关的功能函数级的测试，第二步是完成RDD相关的函数测试，第三步才是结合Streaming进行测试。一步一步来，看来还真是急不得……

0 0