使用pig的注意事项
来源:互联网 发布:php 面试经历 编辑:程序博客网 时间:2024/04/29 13:00
使用pig的注意事项
1.pig中dump和store的区别:
dump和store的作用都是将pig处理的结果进行输出,前者是打印结果,后者是将结果输出至hdfs文件系统中。
(1).Dump表示将数据打印出来,输出所有的元祖,会带上”( )”。
Eg:(zhangsan,22,男)
(2).Store表示将数据输出至hdfs,输出时不会带上”( )”。
Eg:zhangsan22,男
2.pig中组函数的使用:
在关系型数据库的标准sql中,可以不分组而直接使用组函数(比如SUM),这是因为sql引擎会把该表的所有记录当作一个大组对待。但pig中组函数在使用前必须先显示地进行分组才行。
eg:有一个test表,里面有字段name,count,现在要统计test表中的count字段的和,那么可以通过下面简单的sql来做:
select SUM(count) as sum;
如果在pig中那就得先手动的分组了,如何手动分组?只要给每个元祖加上一条内容相同的字段,然后按照该字段进行分组,然后统计分组内的count之和。过程如下:
data = load 'test.csv' as(name:chararray,count:int);
tmp01 = foreach data generate 'pigIsVeryGood' as id,name,count;
tmp02 = group tmp01 by id;
result = foreach tmp02 generate SUM($1.count) as sum;
dump result;
如果要实现输出如下形式的结果:name,count,sum,那么如何做?用sql比较简单:select name,count,sum(count)。
如果用pig,如何实现?思路是在上例的基础上在result上面再加一个字段id,内容为"pigIsVeryGood",过程如下:
data = load 'test.csv' as(name:chararray,count:int);
tmp01 = foreach data generate 'pigIsVeryGood' as id,name,count;
tmp02 = group tmp01 by id;
result = foreach tmp02 generate SUM($1.count) as sum;
dump result;
tmp03 = foreach resultgenerate 'pigIsVeryGood' as id,sum;
tmp04 = join tmp01 by id,tmp03 by id;
finalResult= foreach tmp04 generate name,count,sum;
dump finalResult;
3.Filter使用
是把符合条件的保留下,非符合条件的过滤掉
eg:info = Filter studentsInfo BY (age > 18)
把大于18岁的保留下,小于或等于18岁的过滤掉
4.引用运算符(Dereference Operators.)与解歧义运算符(Disambiguate Operator个人感觉还是叫“域运算符”吧 ::)的区别及使用说明:
(1).引用运算符(Dereference Operators.)
Dereference Operators
Description
Operator
Symbol
Notes
tuple dereference
tuple.id or tuple.(id,…)
Tuple dereferencing can be done by name (tuple.field_name) or position (mytuple.$0). If a set of fields are dereferenced (tuple.(name1, name2) or tuple.($0, $1)), the expression represents a tuple composed of the specified fields. Note that if the dot operator is applied to a bytearray, the bytearray will be assumed to be a tuple.
bag dereference
bag.id or bag.(id,…)
Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.
map dereference
map#'key'
Map dereferencing must be done by key (field_name#key or $0#key). If the pound operator is applied to a bytearray, the bytearray is assumed to be a map. If the key does not exist, the empty string is returned.
Examples
Tuple Example
Suppose we have relation A.
LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));DUMP A;(1,(1,2,3))(2,(4,5,6))(3,(7,8,9))(4,(1,4,7))(5,(2,5,8))
In this example dereferencing is used to retrieve two fields from tuple f2.
X = FOREACH A GENERATE f2.t1,f2.t3;DUMP X;(1,3)(4,6)(7,9)(1,7)(2,8)
Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int,f3:int);DUMP A;(1,2,3)(4,2,1)(8,3,4)(4,3,3)(7,2,5)(8,4,3)B = GROUP A BY f1;DUMP B;(1,{(1,2,3)})(4,{(4,2,1),(4,3,3)})(7,{(7,2,5)})(8,{(8,3,4),(8,4,3)})ILLUSTRATE B;etc …----------------------------------------------------------| b | group: int | a: bag({f1: int,f2: int,f3: int}) |----------------------------------------------------------
In this example dereferencing is used with relation X to project the first field (f1) of each tuple in the bag (a).
X = FOREACH B GENERATE a.f1;DUMP X;({(1)})({(4),(4)})({(7)})({(8),(8)})
Tuple/Bag Example
Suppose we have relation B, formed by grouping relation A (see the GROUP operator for information about the field names in relation B).
A = LOAD 'data' AS (f1:int, f2:int, f3:int);DUMP A;(1,2,3)(4,2,1)(8,3,4)(4,3,3)(7,2,5)(8,4,3)B = GROUP A BY (f1,f2);DUMP B;((1,2),{(1,2,3)})((4,2),{(4,2,1)})((4,3),{(4,3,3)})((7,2),{(7,2,5)})((8,3),{(8,3,4)})((8,4),{(8,4,3)})ILLUSTRATE B;etc …-------------------------------------------------------------------------------| b | group: tuple({f1: int,f2: int}) | a: bag({f1: int,f2: int,f3: int}) |-------------------------------------------------------------------------------| | (8, 3) | {(8, 3, 4), (8, 3, 4)} |-------------------------------------------------------------------------------
In this example dereferencing is used to project a field (f1) from a tuple (group) and a field (f1) from a bag (a).
X = FOREACH B GENERATE group.f1, a.f1;DUMP X;(1,{(1)})(4,{(4)})(4,{(4)})(7,{(7)})(8,{(8)})(8,{(8)})
Map Example
Suppose we have relation A.
A = LOAD 'data' AS (f1:int, f2:map[]);DUMP A;(1,[open#apache])(2,[apache#hadoop])(3,[hadoop#pig])(4,[pig#grunt])
In this example dereferencing is used to look up the value of key 'open'.
X = FOREACH A GENERATE f2#'open';DUMP X;(apache)()()()(2).解歧义运算符(Disambiguate Operator个人感觉还是叫“域运算符”吧 ::
Disambiguate Operator
Use the disambiguate operator ( :: ) to identify field names after JOIN, COGROUP, CROSS, or FLATTEN operators.
In this example, to disambiguate y, use A::y or B::y. In cases where there is no ambiguity, such as z, the :: is not necessary but is still supported.
A = load 'data1' as (x, y);B = load 'data2' as (x, y, z);C = join A by x, B by x;D = foreach C generate A::x,A::y,B::x,B::y,B::z;
未完待续.......
- 使用pig的注意事项
- pig使用的一些注意事项
- pig的简单使用
- Pig的使用
- pig的安装和使用
- pig中cogroup的使用
- Pig的简单使用介绍
- 先占坑 Pig的使用
- Pig的安装配置与基本使用
- 关于pig使用的部分记录
- Pig脚本书写时候注意事项
- 【pig】pig的注释格式
- Hadoop 2.2.0下Pig安装及Pig的基础使用
- 【Pig】ubuntu下pig-0.17.0的安装和基本安装使用
- 【hadoop pig】pig安装及使用
- pig安装与使用
- pig使用记录
- Pig 安装和使用
- offline installing deb for ubuntu and deb management
- 计算机科学研究方向介绍
- Oracle11g 备份恢复总览
- 厦门理工第五届程序设计大赛 总结稿
- VIM 奇技
- 使用pig的注意事项
- C++语言的15个晦涩特性
- js实现全选与全不选
- 为旧浏览器添加新特性
- 指针与const
- 使用FTP4J上传下载删除文件及文件夹
- VI 查找替换
- php获取文件创建时间/修改时间
- jquery例子 点击文本消失