scala 解析json字符串 scala 两种方法实现单词计数
来源:互联网 发布:centos 移除文件 编辑:程序博客网 时间:2024/06/03 09:25
scala中自带了一个scala.util.parsing.json.JSON
然后可以通过JSON.parseFull(jsonString:String)来解析一个json字符串,如果解析成功的话则返回一个Some(map: Map[String, Any]),如果解析失败的话返回None。
所以我们可以通过模式匹配来处理解析结果:
- val str2 = "{\"et\":\"kanqiu_client_join\",\"vtm\":1435898329434,\"body\":{\"client\":\"866963024862254\",\"client_type\":\"android\",\"room\":\"NBA_HOME\",\"gid\":\"\",\"type\":\"\",\"roomid\":\"\"},\"time\":1435898329}"
-
- val b = JSON.parseFull(str2)
- b match {
- // Matches if jsonStr is valid JSON and represents a Map of Strings to Any
- case Some(map: Map[String, Any]) => println(map)
- case None => println("Parsing failed")
- case other => println("Unknown data structure: " + other)
- }
- val lines = List("hello world", "hello spark")
- val wordlist = lines.flatMap(line => line.split(" ")).map(word => (word, 1))
- //方法一:先groupBy再map
- wordlist.groupBy(_._1).map {
- case (word, list) => (word, list.size)
- }.foreach(println)
-
- //方法二:通过aggregate来实现map reduce,效率更高
- val seqop = (result: mutable.HashMap[String, Int], wordcount: (String, Int)) => {
- val addOne = (wordcount._1, result.getOrElse(wordcount._1, 0) + wordcount._2)
- result.+=(addOne)
- }
- val combop = (result1: mutable.HashMap[String, Int], result2: mutable.HashMap[String, Int]) => {
- result1 ++= result2
- }
- val result = wordlist.aggregate(mutable.HashMap[String, Int]())(seqop,combop)
- println(result)
从文件读取进行word count:
- val lines = Source.fromFile("test.txt").getLines()
- val seqop = (result: mutable.HashMap[String, Int], line: String) => {
- val wordcount = line.replace(",", " ").replace(".", " ").replace("(", " ").replace(")", " ").split(" ").filter(_.trim.length > 0).map(word => (word, 1))
- wordcount.foreach(wc => {
- val addOne = (wc._1, result.getOrElse(wc._1, 0) + wc._2)
- result += addOne
- })
- result
- }
- val combop = (result1: mutable.HashMap[String, Int], result2: mutable.HashMap[String, Int]) => {
- result1 ++= result2
- }
-
- val test = lines.aggregate(mutable.HashMap[String, Int]())(seqop, combop)
- println(test)
lines其实是一个迭代器,流式一行一行读取,不要toList,否则读取大文件时可能出现内存溢出问题