大数据平台下利用Python进行Hql并行数据跑批
来源:互联网 发布:类似origin的软件 编辑:程序博客网 时间:2024/06/14 13:58
此是一个简单的示例,如有类似需求进行些许的改动,即可使用。
如有基础数据准备的工作,需要提前准备好;本示例只关注Hql并行跑批方面。
1、具体业务处理的脚本
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/business_proc.py
2、并行的调度
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/BatchThread.py
3、说明
结果数据的查看(可以查看到数据内容及其对应的文件):
find ./ -name "remainAfterDay*" -type f |xargs grep "87503"
find ~/ -name "remainAfterDay*" -type f |xargs grep "87503"
并行参数的准备:
可以根据需要,写代码进行改造;也可以自己手动填写;实际上目录下还有一个并行的线程池文件(/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/threadpool.py)
生产路径预览:
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ pwd
/home/hadoop/nisj/parallelBatchOnBigData-forHql
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ ls -R /home/hadoop/nisj/parallelBatchOnBigData-forHql |more
/home/hadoop/nisj/parallelBatchOnBigData-forHql:
BatchThread.py
business_proc.py
calcResult
threadpool.py
/home/hadoop/nisj/parallelBatchOnBigData-forHql/calcResult:
remainAfterDay_54000_2016-11-01_120.txt
remainAfterDay_54000_2016-11-01_15.txt
remainAfterDay_54000_2016-11-01_30.txt
remainAfterDay_54000_2016-11-01_60.txt
remainAfterDay_54000_2016-11-01_7.txt
remainAfterDay_54000_2016-11-01_90.txt
remainAfterDay_87503_2017-02-17_120.txt
remainAfterDay_87503_2017-02-17_15.txt
remainAfterDay_87503_2017-02-17_30.txt
remainAfterDay_87503_2017-02-17_60.txt
remainAfterDay_87503_2017-02-17_7.txt
remainAfterDay_87503_2017-02-17_90.txt
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$
如有基础数据准备的工作,需要提前准备好;本示例只关注Hql并行跑批方面。
1、具体业务处理的脚本
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/business_proc.py
# -*- coding=utf-8 -*-import osimport warningswarnings.filterwarnings("ignore")def remainAfterDay(premiere_date, after_days, roomid): os.system("""source /etc/profile; \ /usr/lib/hive-current/bin/hive -e " \ add jar /home/hadoop/nisj/udf-jar/hadoop_udf_radixChange.jar; \ create temporary function RadixChange as 'com.kascend.hadoop.RadixChange'; \ with tab_uid_vist_timerange as( \ select distinct RadixChange(lower(uid),16,10) uid from \ bi_all_access_log \ where pt_day between date_add('{premiere_date}',{after_days}) and date_add('{premiere_date}',{after_days}+30)) \ select a2.roomid,count(*) cnt \ from tab_uid_vist_timerange a1 \ inner join (select roomid,uid from xx_uid_list where roomid={roomid}) a2 on a1.uid=a2.uid \ group by a2.roomid;" \ > /home/hadoop/nisj/parallelBatchOnBigData-forHql/calcResult/remainAfterDay_{roomid}_{premiere_date}_{after_days}.txt \ """.format(premiere_date=premiere_date, after_days=after_days, roomid=roomid));def businessProc(premiere_date, after_days, roomid): remainAfterDay(premiere_date, after_days, roomid)# Batch Test# premiere_date = '2016-11-01'# after_days = 7# roomid = 54000# businessProc(premiere_date, after_days, roomid)
2、并行的调度
/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/BatchThread.py
# -*- coding=utf-8 -*-import threadpool, timefrom business_proc import *warnings.filterwarnings("ignore")now_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_timeparList = [ (['2017-02-17', 7, 87503], None), (['2017-02-17', 15, 87503], None), (['2017-02-17', 30, 87503], None), (['2017-02-17', 60, 87503], None), (['2017-02-17', 90, 87503], None), (['2017-02-17', 120, 87503], None), (['2016-11-01', 7, 54000], None), (['2016-11-01', 15, 54000], None), (['2016-11-01', 30, 54000], None), (['2016-11-01', 60, 54000], None), (['2016-11-01', 90, 54000], None), (['2016-11-01', 120, 54000], None)]requests = []request_businessProc = threadpool.makeRequests(businessProc, parList)requests.extend(request_businessProc)main_pool = threadpool.ThreadPool(8)[main_pool.putRequest(req) for req in requests]if __name__ == '__main__': while True: try: time.sleep(30) main_pool.poll() except KeyboardInterrupt: print("**** Interrupted!") break except threadpool.NoResultsPending: break if main_pool.dismissedWorkers: print("Joining all dismissed worker threads...") main_pool.joinAllDismissedWorkers()now_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_time
3、说明
结果数据的查看(可以查看到数据内容及其对应的文件):
find ./ -name "remainAfterDay*" -type f |xargs grep "87503"
find ~/ -name "remainAfterDay*" -type f |xargs grep "87503"
并行参数的准备:
可以根据需要,写代码进行改造;也可以自己手动填写;实际上目录下还有一个并行的线程池文件(/Users/nisj/PycharmProjects/BiDataProc/parallelBatchOnBigData-forHql/threadpool.py)
生产路径预览:
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ pwd
/home/hadoop/nisj/parallelBatchOnBigData-forHql
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$ ls -R /home/hadoop/nisj/parallelBatchOnBigData-forHql |more
/home/hadoop/nisj/parallelBatchOnBigData-forHql:
BatchThread.py
business_proc.py
calcResult
threadpool.py
/home/hadoop/nisj/parallelBatchOnBigData-forHql/calcResult:
remainAfterDay_54000_2016-11-01_120.txt
remainAfterDay_54000_2016-11-01_15.txt
remainAfterDay_54000_2016-11-01_30.txt
remainAfterDay_54000_2016-11-01_60.txt
remainAfterDay_54000_2016-11-01_7.txt
remainAfterDay_54000_2016-11-01_90.txt
remainAfterDay_87503_2017-02-17_120.txt
remainAfterDay_87503_2017-02-17_15.txt
remainAfterDay_87503_2017-02-17_30.txt
remainAfterDay_87503_2017-02-17_60.txt
remainAfterDay_87503_2017-02-17_7.txt
remainAfterDay_87503_2017-02-17_90.txt
[hadoop@emr-worker-9 parallelBatchOnBigData-forHql]$
阅读全文
0 0
- 大数据平台下利用Python进行Hql并行数据跑批
- 利用Python进行大数据分析(完整中文版689页)
- 利用Python进行大数据分析(完整中文版689页)
- 《利用Python进行数据挖掘》
- 在线实时大数据平台Storm并行度试验
- 在大数据平台进行ETL_20170919
- 利用Python进行数据分析---ch02《MovieLens 1M数据集(下)》读书笔记
- 利用Python进行数据分析--时间序列
- 利用Python 的 Pandas进行数据分析
- 《利用python 进行数据分析》要点记录
- 利用python进行数据分析笔记
- 利用Python进行数据分析笔记(一
- 利用python进行数据分析随笔小记
- 《利用Python 进行数据分析》 - 笔记(2)
- 《利用Python 进行数据分析》 - 笔记(3)
- 《利用Python 进行数据分析》 - 笔记(4)
- 利用python进行数据分析-NumPy基础
- 《利用Python 进行数据分析》 - 笔记(5)
- PHP浮点运算结果出现误差原因分析及解决方案
- 关于齐次坐标的理解(经典)
- 【人性化】元素据管理模块新增序号展示
- unity使用localEulerAngles、rotation和Rotate进行旋转的区别
- Java 链表模拟栈
- 大数据平台下利用Python进行Hql并行数据跑批
- 通过读取文件向mysql表单中插入某几列数据
- 函数
- Lottie动画教程
- P287 8.8
- 三、springboot项目的简单使用之:JPA使用操作数据库
- CentOS 7 系列(一)系统服务 systemd
- MyBatis 知识概览
- MWeb 专业的Markdown写作、记笔记、静态博客生成软件