examples / Dataset Wordcount

来源:互联网 发布:儒释道网络电视台 编辑:程序博客网 时间:2024/06/06 03:00

https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Wordcount.html


In this example, we take lines of text and split them up into words. Next, we count the number of occurances of each work in the set using a variety of Spark API.




dbutils.fs.put("/home/spark/1.6/lines","""
Hello hello world
Hello how are you world
""", true)
Wrote 43 bytes.
res0: Boolean = true



import org.apache.spark.sql.functions._

// Load a text file and interpret each line as a java.lang.String
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]
val result = ds
  .flatMap(_.split(" "))               // Split on whitespace
  .filter(_ !="\"\"" )                     // Filter empty words
  .toDF()                              // Convert to DataFrame to perform aggregation / sorting
  .groupBy($"value")                   // Count number of occurences of each word
  .agg(count("*") as "numOccurances")
  .orderBy($"numOccurances" desc)      // Show most common words first

display(result)
world 2
Hello 2
are 1
hello 1
how 1
you 1
value numOccurances
It is also possible to perform the aggregation in pure scala, instead of switching to DataFrames. In the following example, we perform the same wordcount, normalizing the case of the word (i.e. group "hello" and "Hello" together)



val wordCount = 
  ds
    .flatMap(_.split(" "))
     .filter(_ !="\"\"" )      
    .groupBy(_.toLowerCase()) // Instead of grouping on a column expression (i.e. $"value") we pass a lambda function
    .count()

display(wordCount.toDF())
are 1
hello 3
how 1
world 2
you 1
0 0
原创粉丝点击