大数据平台下利用Python进行Hql并行数据跑批

来源:互联网 发布:类似origin的软件 编辑:程序博客网 时间:2024/06/14 13:58
此是一个简单的示例,如有类似需求进行些许的改动,即可使用。
如有基础数据准备的工作,需要提前准备好;本示例只关注Hql并行跑批方面。
1、具体业务处理的脚本
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/business_proc.py
# -*- coding=utf-8 -*-import osimport warningswarnings.filterwarnings("ignore")def remainAfterDay(premiere_date, after_days, roomid):    os.system("""source /etc/profile; \            /usr/lib/hive-current/bin/hive -e " \            add jar /home/hadoop/nisj/udf-jar/hadoop_udf_radixChange.jar; \            create temporary function RadixChange as 'com.kascend.hadoop.RadixChange'; \            with tab_uid_vist_timerange as( \            select distinct RadixChange(lower(uid),16,10) uid from  \            bi_all_access_log \            where pt_day between date_add('{premiere_date}',{after_days}) and date_add('{premiere_date}',{after_days}+30)) \            select a2.roomid,count(*) cnt \            from tab_uid_vist_timerange a1 \            inner join (select roomid,uid from xx_uid_list where roomid={roomid}) a2 on a1.uid=a2.uid \            group by a2.roomid;" \            > /home/hadoop/nisj/parallelBatchOnBigData-forHql/calcResult/remainAfterDay_{roomid}_{premiere_date}_{after_days}.txt \            """.format(premiere_date=premiere_date, after_days=after_days, roomid=roomid));def businessProc(premiere_date, after_days, roomid):    remainAfterDay(premiere_date, after_days, roomid)# Batch Test# premiere_date = '2016-11-01'# after_days = 7# roomid = 54000# businessProc(premiere_date, after_days, roomid)

2、并行的调度
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/BatchThread.py
# -*- coding=utf-8 -*-import threadpool, timefrom business_proc import *warnings.filterwarnings("ignore")now_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_timeparList = [    (['2017-02-17', 7, 87503], None),    (['2017-02-17', 15, 87503], None),    (['2017-02-17', 30, 87503], None),    (['2017-02-17', 60, 87503], None),    (['2017-02-17', 90, 87503], None),    (['2017-02-17', 120, 87503], None),    (['2016-11-01', 7, 54000], None),    (['2016-11-01', 15, 54000], None),    (['2016-11-01', 30, 54000], None),    (['2016-11-01', 60, 54000], None),    (['2016-11-01', 90, 54000], None),    (['2016-11-01', 120, 54000], None)]requests = []request_businessProc = threadpool.makeRequests(businessProc, parList)requests.extend(request_businessProc)main_pool = threadpool.ThreadPool(8)[main_pool.putRequest(req) for req in requests]if __name__ == '__main__':    while True:        try:            time.sleep(30)            main_pool.poll()        except KeyboardInterrupt:            print("**** Interrupted!")            break        except threadpool.NoResultsPending:            break    if main_pool.dismissedWorkers:        print("Joining all dismissed worker threads...")        main_pool.joinAllDismissedWorkers()now_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_time

3、说明
结果数据的查看(可以查看到数据内容及其对应的文件):
find ./ -name "remainAfterDay*" -type f |xargs grep "87503"
find ~/ -name "remainAfterDay*" -type f |xargs grep "87503"  

并行参数的准备:
可以根据需要,写代码进行改造;也可以自己手动填写;实际上目录下还有一个并行的线程池文件(/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/threadpool.py

生产路径预览:
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ pwd
/home/hadoop/nisj/parallelBatchOnBigData-forHql
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ ls -R /home/hadoop/nisj/parallelBatchOnBigData-forHql |more
/home/hadoop/nisj/parallelBatchOnBigData-forHql:
BatchThread.py
business_proc.py
calcResult
threadpool.py

/home/hadoop/nisj/parallelBatchOnBigData-forHql/calcResult:
remainAfterDay_54000_2016-11-01_120.txt
remainAfterDay_54000_2016-11-01_15.txt
remainAfterDay_54000_2016-11-01_30.txt
remainAfterDay_54000_2016-11-01_60.txt
remainAfterDay_54000_2016-11-01_7.txt
remainAfterDay_54000_2016-11-01_90.txt
remainAfterDay_87503_2017-02-17_120.txt
remainAfterDay_87503_2017-02-17_15.txt
remainAfterDay_87503_2017-02-17_30.txt
remainAfterDay_87503_2017-02-17_60.txt
remainAfterDay_87503_2017-02-17_7.txt
remainAfterDay_87503_2017-02-17_90.txt
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$