用spark建立一个单词统计的应用

来源：互联网发布：linux 搜索文件内容编辑：程序博客网时间：2024/05/18 16:35

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

本文我们将建立一个简单的单词统计应用

创建rdd

In [1]:

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']wordsRDD = sc.parallelize(wordsList, 4)# Print out the type of wordsRDDprint type(wordsRDD)

<class 'pyspark.rdd.RDD'>

将单词变成复数形式并且进行测试

In [2]:

# One way of completing the functiondef makePlural(word):    return word + 's'print makePlural('cat')

cats

测试结果是否正确，如果不正确就返回'incorrect result: makePlural does not add an s'

In [3]:

# Make sure to rerun any cell you change before trying the test againfrom test_helper import TestTest.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')

1 test passed.

应用makePlural()函数到rdd上

In [4]:

# TODO: Replace <FILL IN> with appropriate codepluralRDD = wordsRDD.map(makePlural)print pluralRDD.collect()

['cats', 'elephants', 'rats', 'rats', 'cats']

In [5]:

# TEST Apply makePlural to the base RDD(1c)Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],                  'incorrect values for pluralRDD')

1 test passed.

使用lambda函数将单词变成复数形式的功能

In [6]:

# TODO: Replace <FILL IN> with appropriate codepluralLambdaRDD = wordsRDD.map(lambda x: x + 's')print pluralLambdaRDD.collect()

['cats', 'elephants', 'rats', 'rats', 'cats']

In [7]:

# TEST Pass a lambda function to map (1d)Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],                  'incorrect values for pluralLambdaRDD (1d)')

1 test passed.

计算每个单词的长度

In [8]:

# TODO: Replace <FILL IN> with appropriate codepluralLengths = (pluralRDD                 .map(lambda x: len(x))                 .collect())print pluralLengths

[4, 9, 4, 4, 4]

In [9]:

# TEST Length of each word (1e)Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],                  'incorrect values for pluralLengths')

1 test passed.

接下来创建rdd对

In [10]:

# TODO: Replace <FILL IN> with appropriate codewordPairs = wordsRDD.map(lambda x: (x, 1))print wordPairs.collect()

[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]

In [11]:

# TEST Pair RDDs (1f)Test.assertEquals(wordPairs.collect(),                  [('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],                  'incorrect value for wordPairs')

1 test passed.

In [ ]:

下面我们将对每个单词统计其出现的次数，实现这个目标有很多方法。

In [12]:

# TODO: Replace <FILL IN> with appropriate code# Note that groupByKey requires no parameterswordsGrouped = wordPairs.groupByKey()for key, value in wordsGrouped.collect():    print '{0}: {1}'.format(key, list(value))

rat: [1, 1]elephant: [1]cat: [1, 1]

In [13]:

# TEST groupByKey() approach (2a)Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),                  [('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],                  'incorrect value for wordsGrouped')

1 test passed.

In [14]:

# TODO: Replace <FILL IN> with appropriate codewordCountsGrouped = wordsGrouped.map(lambda (k, v): (k, sum(v)))print wordCountsGrouped.collect()

[('rat', 2), ('elephant', 1), ('cat', 2)]

In [15]:

# TEST Use groupByKey() to obtain the counts (2b)Test.assertEquals(sorted(wordCountsGrouped.collect()),                  [('cat', 2), ('elephant', 1), ('rat', 2)],                  'incorrect value for wordCountsGrouped')

1 test passed.

用reduceByKey()实现统计每个单词出现的次数的任务

In [16]:

# TODO: Replace <FILL IN> with appropriate code# Note that reduceByKey takes in a function that accepts two values and returns a single valuewordCounts = wordPairs.reduceByKey(lambda x, y: x + y)print wordCounts.collect()

[('rat', 2), ('elephant', 1), ('cat', 2)]

In [17]:

# TEST Counting using reduceByKey (2c)Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],                  'incorrect value for wordCounts')

1 test passed.

In [18]:

# 将两个步骤连起来wordCountsCollected = (wordsRDD                       .map(lambda x: (x, 1))                       .reduceByKey(lambda x, y: x + y)                       .collect())print wordCountsCollected

[('rat', 2), ('elephant', 1), ('cat', 2)]

In [19]:

# TEST All togetherTest.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],                  'incorrect value for wordCountsCollected')

1 test passed.

统计不同单词的个数

In [20]:

uniqueWords = wordCounts.count()print uniqueWords

In [21]:

# TEST Unique words Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')

1 test passed.

计算每个单词平均出现的次数

In [22]:

from operator import addtotalCount = (wordCounts              .map(lambda (k, v): v)              .reduce(lambda x, y: x + y))average = totalCount / float(wordCounts.count())print totalCountprint round(average, 2)

51.67

In [23]:

# TEST Mean using reduce Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')

1 test passed.

接下来看一个完整的在单词统计在文本上的应用

首先，定义一个单词统计的函数

In [24]:

# TODO: Replace <FILL IN> with appropriate codedef wordCount(wordListRDD):    """Creates a pair RDD with word counts from an RDD of words.    Args:        wordListRDD (RDD of str): An RDD consisting of words.    Returns:        RDD of (str, int): An RDD consisting of (word, count) tuples.    """    wordCountsCollected = (wordListRDD                       .map(lambda x: (x, 1))                       .reduceByKey(lambda x, y: x + y))    return wordCountsCollectedprint wordCount(wordsRDD).collect()

[('rat', 2), ('elephant', 1), ('cat', 2)]

In [25]:

# TEST wordCount function (4a)Test.assertEquals(sorted(wordCount(wordsRDD).collect()),                  [('cat', 2), ('elephant', 1), ('rat', 2)],                  'incorrect definition for wordCount function')

1 test passed.

接下来将文本中的标点符号去掉并将字母都转换为小写

In [26]:

# TODO: Replace <FILL IN> with appropriate codeimport reimport stringdef removePunctuation(text):    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.    Note:        Only spaces, letters, and numbers should be retained.  Other characters should should be        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after        punctuation is removed.    Args:        text (str): A string.    Returns:        str: The cleaned up string.    """    regex = re.compile('[%s]' % re.escape(string.punctuation))    return regex.sub('', text).lower().strip()print removePunctuation('Hi, you!')print removePunctuation(' No under_score!')

hi youno underscore

In [27]:

# TEST Capitalization and punctuation Test.assertEquals(removePunctuation(" The Elephant's 4 cats. "),                  'the elephants 4 cats',                  'incorrect definition for removePunctuation function')

1 test passed.

In [28]:

import os.pathfileName = os.path.join('/Users/youwei.tan/Downloads', 'shakespeare.txt')shakespeareRDD = (sc                  .textFile(fileName, 8)                  .map(removePunctuation))print '\n'.join(shakespeareRDD                .zipWithIndex()  # to (line, lineNum)                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'                .take(15))

0: the project gutenberg ebook of the complete works of william shakespeare by1: william shakespeare2: 3: this ebook is for the use of anyone anywhere at no cost and with4: almost no restrictions whatsoever  you may copy it give it away or5: reuse it under the terms of the project gutenberg license included6: with this ebook or online at wwwgutenbergorg7: 8: this is a copyrighted project gutenberg ebook details below9: please follow the copyright guidelines in this file10: 11: title the complete works of william shakespeare12: 13: author william shakespeare14:

在使用wordcount()函数之前，需要完成两个任务： 1、以空格对字符串进行分割 2、过滤掉空行

In [29]:

shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x: x.split())shakespeareWordCount = shakespeareWordsRDD.count()print shakespeareWordsRDD.top(5)print shakespeareWordCount

[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds']903705

In [30]:

# TEST Words from lines # This test allows for leading spaces to be removed either before or after# punctuation is removed.Test.assertTrue(shakespeareWordCount == 903705 or shakespeareWordCount == 928908,                'incorrect value for shakespeareWordCount')Test.assertEquals(shakespeareWordsRDD.top(5),                  [u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],                  'incorrect value for shakespeareWordsRDD')

1 test passed.1 test passed.

In [32]:

shakeWordsRDD = shakespeareWordsRDD  # already removedshakeWordCount = shakeWordsRDD.count()print shakeWordCount

In [33]:

# TEST Remove empty elements Test.assertEquals(shakeWordCount, 903705, 'incorrect value for shakeWordCount')

1 test passed.

实现空格分割和过滤掉恐狼的另一种方法

In [35]:

shakespeareRDD.map(lambda x: x.split()).filter(lambda x: len(x)>0).flatMap(lambda x:x).take(10)

Out[35]:

[u'the', u'project', u'gutenberg', u'ebook', u'of', u'the', u'complete', u'works', u'of', u'william']

接下来统计单词

In [36]:

# TODO: Replace <FILL IN> with appropriate codetop15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, key=lambda (k, v): -v)print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))

the: 27825and: 26791i: 20681to: 19261of: 18289a: 14667you: 13716my: 12481that: 11135in: 11027is: 9621not: 8745for: 8261with: 8046me: 7769

1 0