使用SparkSQL分析CSDN泄露的用户数据[top-n]

来源:互联网 发布:楼体亮化设计软件 编辑:程序博客网 时间:2024/05/21 08:41

说明

CSDN泄露的用户数据的格式如下:

aaaaaaa # bbbbbb # xxxxxx@hotmail.comaaaaaaa # bbbbbb # xxxxxx@hotmail.comaaaaaaa # bbbbbb # xxxxxx@hotmail.comaaaaaaa # bbbbbb # xxxxxx@hotmail.com___csdn_1aaaaaaa # bbbbbb # xxxxxx@hotmail.com

格式为:用户名、 密码、邮箱,字段之间使用" # “(星两边各有一个空格)进行分隔。

分析最多人使用的TOPn个密码

 1 2 3 4 5 6 7 8 9101112131415161718
case class User(username: String, password: String, email: String)var filePath = "/data/www.csdn.net.sql"var linesRDD = sc.textFile(filePath)var partsRDD = linesRDD.map(l => l.split(","))var csdnRDD = partsRDD.map(r => User(username=r(0), password=r(1), email=r(2)))var csdnDF = csdnRDD.toDF()csdnDF.printSchema()csdnDF.count()csdnDF.registerTempTable("csdn")var pwdSet = sqlContext.sql("SELECT password,COUNT(password) AS password_cnt FROM csdn GROUP BY password ORDER BY password_cnt DESC LIMIT 50")pwdSet.map(r => "Password: " + r(0) + " Count: " + r(1)).collect().foreach(println)csdnDF.groupBy("password").count().show()
0 0
原创粉丝点击