Pig数据分析引擎 二 使用PigLatin语句分析数据

来源:互联网 发布:js判断对象{}是否为空 编辑:程序博客网 时间:2024/05/16 02:30

配置好环境

启动hadoop
--启动hadoopstart-all.sh--查看是否处于安全模式hdfs dfsadmin -safemode get--启动historyServer 记录了所有的mapreduce程序的历史信息[root@linux111 ~]# mr-jobhistory-daemon.sh start historyserver
进入pig命令提示符
[root@linux111 ~]# piggrunt>
查看准备好的数据  hadoop input目录下的 emp.csv的文件
grunt> cat /input/emp.csv7369,SMITH,CLERK,7902,1980/12/17,800,,207499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,307521,WARD,SALESMAN,7698,1981/2/22,1250,500,307566,JONES,MANAGER,7839,1981/4/2,2975,,207654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,307698,BLAKE,MANAGER,7839,1981/5/1,2850,,307782,CLARK,MANAGER,7839,1981/6/9,2450,,107788,SCOTT,ANALYST,7566,1987/4/19,3000,,207839,KING,PRESIDENT,,1981/11/17,5000,,107844,TURNER,SALESMAN,7698,1981/9/8,1500,0,307876,ADAMS,CLERK,7788,1987/5/23,1100,,207900,JAMES,CLERK,7698,1981/12/3,950,,307902,FORD,ANALYST,7566,1981/12/3,3000,,207934,MILLER,CLERK,7782,1982/1/23,1300,,10
将emp.csv的文件加载为一个emp表(bag)
emp = load '/input/emp.csv';
查看emp表的结构
grunt> describe emp;Schema for emp unknown.
如上所示 显示我们的emp表中的结构为空,因为我们直接是在加载外部csv的文件,在创建表的时候并没有指定结构将emp.csv文件加载成一个表emp1(bag),并且指定tuple的schema
grunt> emp1 = load '/input/emp.csv' as (empno,ename,job,mgr,hiredate,sal,comm,deptno);
再次查看表结构
grunt> describe emp1;emp1: {empno: bytearray,ename: bytearray,job: bytearray,mgr: bytearray,hiredate: bytearray,sal: bytearray,comm: bytearray,deptno: bytearray}
这个时候 就显示出来的emp1表的表结构,注意:如果在创建bag的时候 没有指定列的数据类型  默认的数据类型是bytearray(字节数组)查看emp1表的数据
//这里会触发一个计算,转换为mapreduce程序,在yarn容器记录中可以查看dump emp1;(7369,SMITH,CLERK,7902,1980/12/17,800,,20,,,,,,,)(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30,,,,,,,)(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30,,,,,,,)(7566,JONES,MANAGER,7839,1981/4/2,2975,,20,,,,,,,)(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30,,,,,,,)(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30,,,,,,,)(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10,,,,,,,)(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20,,,,,,,)(7839,KING,PRESIDENT,,1981/11/17,5000,,10,,,,,,,)(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30,,,,,,,)(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20,,,,,,,)(7900,JAMES,CLERK,7698,1981/12/3,950,,30,,,,,,,)(7902,FORD,ANALYST,7566,1981/12/3,3000,,20,,,,,,,)(7934,MILLER,CLERK,7782,1982/1/23,1300,,10,,,,,,,)
将emp.csv文件加载成一个表emp2(bag),并且指定tuple的schema,并指定每列的数据类型
grunt> emp2 = load '/input/emp.csv' as (empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
再次查看emp2表的结构
grunt> describe emp2;emp2: {empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int}
然后使用dump命令 查看数据
dump emp2;(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)(,,,,,,,)
可以发现 上方emp2表中并没有数据,pig中在使用了数据类型之后,默认的分隔符是 tab 键,在tab键的情况下,数据自然查看不出指定分隔符:使用pig的内置函数 PigStorage
emp2 = load '/input/emp.csv' using PigStorage (',') as (empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
pig在遇到表相同的情况 会自动覆盖以前的表再次使用dump命令查看表内容
dump emp2;(7369,SMITH,CLERK,7902,1980/12/17,800,,20)(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)(7566,JONES,MANAGER,7839,1981/4/2,2975,,20)(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)(7839,KING,PRESIDENT,,1981/11/17,5000,,10)(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)(7900,JAMES,CLERK,7698,1981/12/3,950,,30)(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)(7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
创建部门表查看文件在本地的位置
grunt> sh ls ~/inputdata.txtdept.csvemp.csv
使用pig的  copyFromLocal 将文件从linux 上赋值到 hadoop 上
grunt> copyFromLocal /root/input/dept.csv /input
查看部门表中有哪些数据
grunt> cat /input/dept.csv10,ACCOUNTING,NEW YORK20,RESEARCH,DALLAS30,SALES,CHICAGO40,OPERATIONS,BOSTON
在pig上创建 dept bag
grunt> dept = load '/input/dept.csv' using PigStorage (',') as (deptno:int,dname:chararray,loc:chararray);

在Pig上进行查询操作

查询员工信息, 要求查询的列,员工号,员工姓名,薪水
SQL:select empno,ename,sal from emp2;grunt> emp4 = foreach emp2 generate empno,ename,sal;
查询员工信息, 按sal排序
SQL:select * from emp2 order by sal;PigLatin: emp5 = order emp2 by sal;
分组:查询每个部门工资的最大值   略微麻烦
SQL:select deptno,max(sal) from emp2 group by deptno;PigLatin:需要分为两步操作//先将emp2 表进行分组 按部门号 ----> emp61emp61 = group emp2 by deptno;//查看emp61表的结构grunt> describe emp61;emp61: {group: int,emp2: {(empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}//查看表的数据(10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),                         (7839,KING,PRESIDENT,,1981/11/17,5000,,10),                         (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})                    (20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),                         (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),                         (7369,SMITH,CLERK,7902,1980/12/17,800,,20),                         (7566,JONES,MANAGER,7839,1981/4/2,2975,,20),                         (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})                    (30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),                         (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),                         (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),                         (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),                         (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),                         (7900,JAMES,CLERK,7698,1981/12/3,950,,30)})//在emp61的表上在进行操作emp62 = foreach emp61 generate group,MAX(emp2.sal);//最后查看emp62表中最后的结果(10,5000)(20,3000)(30,2850)
查询10部门员工的信息
SQL:select * from emp2 where deptno = 10;PigLatin:emp7 = filter emp2 by deptno == 10;//注意此处为==
多表查询:查询员工信息: 员工姓名  部门名称
SQL:   select e.ename,d.dname from emp2 e,dept d where e.deptno=d.deptno;PL:    emp81 = join dept by deptno,emp2 by deptno;       emp82 = foreach emp81 generate dept::dname,emp2::ename
集合运算:并集  交集  差集查询10和20号部门的员工信息
    SQL: select * from emp2 where deptno=10         union         select * from emp2 where deptno=20;
问题:在SQL中,是否任意的集合都可以进行集合运算?(做集合运算,对集合有要求吗?)         有要求的:参与运算的各个集合必须列数相同且类型一致
    PL: emp10 = filter emp2 by deptno==10;        emp20 = filter emp2 by deptno==20;        执行集合运算: emp10_20 = union emp10,emp20;
原创粉丝点击