pig—WordCount analysis

来源:互联网 发布:visio 2011 mac破解版 编辑:程序博客网 时间:2024/05/29 11:30
 grunt> cat /opt/dataset/input.txtkeyword1 keyword2keyword2 keyword4keyword3 keyword1keyword4 keyword4 A = LOAD '/opt/dataset/input.txt' using PigStorage('\n')  as (line:chararray); B = foreach A generate TOKENIZE((chararray)$0); C = foreach B generate flatten($0) as word; D = group C by word; E = foreach D generate COUNT(C), group; dump B;({(keyword1),(keyword2)})({(keyword2),(keyword4)})({(keyword3),(keyword1)})({(keyword4),(keyword4)}) dump C;(keyword1)(keyword2)(keyword2)(keyword4)(keyword3)(keyword1)(keyword4)(keyword4) dump D;(keyword1,{(keyword1),(keyword1)})(keyword2,{(keyword2),(keyword2)})(keyword3,{(keyword3)})(keyword4,{(keyword4),(keyword4),(keyword4)}) dump E;(2,keyword1)(2,keyword2)(1,keyword3)(3,keyword4) store E into './wordcount';
TOKENIZESplits a string and outputs a bag of words.SyntaxTOKENIZE(expression)       TermsexpressionAn expression with data type chararray.UsageUse the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*).ExampleIn this example the strings in each row are split.A  = LOAD 'data' AS (f1:chararray);DUMP A;(Here is the first string.)(Here is the second string.)(Here is the third string.)X = FOREACH A GENERATE TOKENIZE(f1);DUMP X;({(Here),(is),(the),(first),(string.)})({(Here),(is),(the),(second),(string.)})({(Here),(is),(the),(third),(string.)})



0 0
原创粉丝点击