微博用户数据分析

来源:互联网 发布:windows回滚工具 编辑:程序博客网 时间:2024/05/24 00:36

一、数据描述

1)数据参数

用户的历史微博数据

截止到20131215

压缩后244MB,解压后878MB

2)数据类型

整个数据是json格式

 

json中字段描述:

beCommentWeiboId  是否评论

beForwardWeiboId 是否是转发微博

catchTime 抓取时间

commentCount 评论次数

content  内容

createTime 创建时间

info1 信息字段1

info2信息字段2

info3信息字段3

mlevel   no sure

musicurl 音乐链接

pic_list   照片列表(可以有多个)

praiseCount  点赞人数

reportCount  转发人数

source    数据来源

userId    用户id

videourl 视频链接

weiboId 微博id

weiboUrl      微博网址

二、实操题目

1. 组织数据(Hive

    创建Hive表weibo(json STRING),表只有一个字段,导入所有数据,并验证查询前3条数据

   1>建表(建库)

       ①创建数据库:create database weibo;

       ②切换数据库:use weibo;

       ③创建外部表:create external tableweibo(json string) row format delimited lines terminated by "\n"stored as textfile location "/exam/weibo";

2>导入数据

       ①上传数据:

       ②解压文件:unzip weibo.zip

       ③上传数据:hdfs dfs -put ~/data/619893/*/exam/weibo/

   3>验证查询前三条数据

       select json from weibo limit 3;

 

2. 统计需求(Hive

(1)统计微博总量和独立用户数

              ①确认是否有脏数据:通过结果很容易看出没有

       select get_json_object(js.json,'$.userId') from (select json from weibo)js where substr(json,1,1)="{";

       ②正常查询:

select "微博总量:",sum(user.cnt),"独立用户总数",count(user.userId)

       from(

           select jj.uid as userId ,count(*) as cnt

           from (

                selectget_json_object(substring(js.json,2),'$.userId')

               as uid

                from (

                    select json from weibo

                    )

                    as js

                ) as jj

                group by jj.uid

       ) as user;

(2)统计用户所有微博被转发的总次数,并输出TOP-3用户

       ①创建一个视图:

       create view userRecord

       as select 

       get_json_object(substring(js.json,2),'$.beCommentWeiboId') asbeCommentWeiboId ,

       get_json_object(substring(js.json,2),'$.beForwardWeiboId') asbeForwardWeiboId ,

       get_json_object(substring(js.json,2),'$.catchTime') as catchTime ,

       get_json_object(substring(js.json,2),'$.commentCount') as commentCount ,

       get_json_object(substring(js.json,2),'$.content') as content,

       get_json_object(substring(js.json,2),'$.createTime') as createTime ,

       get_json_object(substring(js.json,2),'$.info1') as info1 ,

       get_json_object(substring(js.json,2),'$.info2') as info2,

       get_json_object(substring(js.json,2),'$.info3') as info3,

       get_json_object(substring(js.json,2),'$.mlevel') as mlevel,

       get_json_object(substring(js.json,2),'$.musicurl') as musicurl,

       get_json_object(substring(js.json,2),'$.pic_list') as pic_list ,

       get_json_object(substring(js.json,2),'$.praiseCount') as praiseCount,

       get_json_object(substring(js.json,2),'$.reportCount') as reportCount,

       get_json_object(substring(js.json,2),'$.source') as source ,

       get_json_object(substring(js.json,2),'$.userId') as userId ,

       get_json_object(substring(js.json,2),'$.videourl') as videourl ,

       get_json_object(substring(js.json,2),'$.weiboId') as weiboId,

       get_json_object(substring(js.json,2),'$.weiboUrl') as weiboUrl

       from (select json from weibo) js;

       ②执行查询:

       select userId,sum(reportCount) as cnt from userRecord group by userIdorder by cnt DESC limit 3;

 

(3)统计微博被转发最多的前3位用户的id

       执行查询:

       select uu.userId

       from (

           select userId,count(*)

           as cnt

           from userRecord

           where reportCount>0

           group by userId

           order by cnt

           )

           as uu limit 3;;

 

(4)统计每个用户的发送微博总数,并存储到临时表

    创建临时表:

   create table tempory_uid_sum(

       uid string,

       total int

);

    查询并插入数据:

   insert overwrite table tempory_uid_sum select userId,sum(reportCount)from userRecord group by userId;

 

(5)统计带图片的微博数

       执行查询:

       select count(*) from userRecord where length(pic_list) >2;

 

(6)统计使用iphone发微博的独立用户数

       执行查询:

       select count(distinct(userId)) from userRecord where source="iPhone客户端";

 

(7)统计微博中用户ID与数据来源信息,将其放入视图中,然后统计视图中数据来源是“iPad客户端”的用户数目

       创建视图:

       create view view_uid_source

       as

           select userId,source

            from

           userRecord;

       执行查询:

       select count(distinct(userId)) from view_uid_source wheresource="iPad客户端";

3 特殊需求

①往hive中添加jar:add jar/home/hadoop/data/UDF_11.jar

②创建临时函数:create temporary functionAddTwo as "org.zkpk.func.Add";

③创建临时函数:create temporary function WordCount as"org.zkpk.func.QueryWord";

(1)实现Hive UDF完成下面的需求:

    将微博的点赞人数与转发人数相加求和,并将相加之和降序排列,取前10条记录

    ①代码

import org.apache.hadoop.hive.ql.exec.UDF;

    public class Add extends UDF {

        public Integer evaluate(Integerval1,Integer val2)  throws Exception{

            return val1+val2;

        }

    }

           ②查询:

    select b.*

    from (

            select createTime,userId AddTwo(cast(praiseCountas int),cast(reportCount as int))  astotal

            from userRecord

        ) as a ,

        (

            select * from userRecord

        ) as b

        where a.userId=b.userId

        and a.catchTime = b.catchTime

        order by a.total;

(2) 实现Hive UDF完成下面的需求:

    1>微博内容content中的包含某个词的个数,方法返回值是int类型的数值

    ①代码:

import org.apache.hadoop.hive.ql.exec.UDF;

public classQueryWord extends UDF {

         static int counter = 0;

         public int stringNumbers(String str,String se) {

                   if (str.indexOf(se) == -1) {

                            return 0;

                   } else if (str.indexOf(se) !=-1) {

                            counter++;

                            stringNumbers(str.substring(str.indexOf(se)+ se.length()), se);

                            return counter;

                   }

                   return 0;

         }

 

         public Integer evaluate(String val1,String val2) throws Exception {

                   int sum = stringNumbers(val1,val2);

                   Integer in = newInteger(sum);

                   return in;

         }

}

    2>使用该方法统计微博内容中出现“iphone”次数最多的用户,最终结果输出用户ID和次数

    ①查询:

selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt;

②查询(这个查询完全是为了方便查看结果):

    selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt DESC limit 5;

0 0
原创粉丝点击