Python自动化拉取Mysql数据并装载到Hive(V3.0)
来源:互联网 发布:苹果电脑激活windows 编辑:程序博客网 时间:2024/06/06 15:35
相关参考:
Python自动化拉取Mysql数据并装载到Hive(V2.0)-->http://blog.csdn.net/babyfish13/article/details/70792158
Python自动化拉取Mysql数据并装载到Hive-->http://blog.csdn.net/babyfish13/article/details/73618331
相较于之前的版本,该版本的优化主要体现在如下方面:
新增增量数据跑批功能(凡是增量跑批的数据都是采用分区表的形式进行数据的存储;新增了与增量跑批相适应的增量数据初始化,可以自定义增量跑批的表及数据增量的日期范围;可以根据需要自定义增量的时间标志)
表的配置文件,摈弃了原件以文本形式存储的方式,采用.py文件
跑批过程中的日志记录,虽暂未完善
程序的目录结构进行了优化,有public、conf、功能目录mysql2hive及可以自己定义的logs及dataTmp目录
并行调度方面斩未考虑。
1、配置文件
1.1系统级配置
/Users/nisj/PycharmProjects/mysql2Hive/conf/systemParConf.py
1.2表级配置
/Users/nisj/PycharmProjects/mysql2Hive/conf/tableParConf.py
2、公共代码部分
2.1、公共系统变量获取代码
/Users/nisj/PycharmProjects/mysql2Hive/public/getSystemPar.py
2.2、源表结构及字段、Hive建表脚本代码
/Users/nisj/PycharmProjects/mysql2Hive/public/getSrcMetadata.py
2.3、日志记录代码
/Users/nisj/PycharmProjects/mysql2Hive/public/logRecord.py
3、功能实现部分
3.1、Hive库建表
/Users/nisj/PycharmProjects/mysql2Hive/mysql2hive/hiveTabCreate.py
3.2、mysql数据向Hive传输
/Users/nisj/PycharmProjects/mysql2Hive/mysql2hive/mysqlData2Hive.py
4、总控及调度
/Users/nisj/PycharmProjects/mysql2Hive/SerialBatch.py
5、生产定时
生产上crontab执行:
说明:真用crontab调的时候,有几个路径问题要写成绝对路径。
Python自动化拉取Mysql数据并装载到Hive(V2.0)-->http://blog.csdn.net/babyfish13/article/details/70792158
Python自动化拉取Mysql数据并装载到Hive-->http://blog.csdn.net/babyfish13/article/details/73618331
相较于之前的版本,该版本的优化主要体现在如下方面:
新增增量数据跑批功能(凡是增量跑批的数据都是采用分区表的形式进行数据的存储;新增了与增量跑批相适应的增量数据初始化,可以自定义增量跑批的表及数据增量的日期范围;可以根据需要自定义增量的时间标志)
表的配置文件,摈弃了原件以文本形式存储的方式,采用.py文件
跑批过程中的日志记录,虽暂未完善
程序的目录结构进行了优化,有public、conf、功能目录mysql2hive及可以自己定义的logs及dataTmp目录
并行调度方面斩未考虑。
1、配置文件
1.1系统级配置
/Users/nisj/PycharmProjects/mysql2Hive/conf/systemParConf.py
# -*- coding=utf-8 -*-import warningswarnings.filterwarnings("ignore")# public Database configsrcMysqlConfig_Tv_server = { 'host': 'MysqlHostInnerIp', # 'host': 'MysqlHostOuterIp', 'user': 'MysqlUser', 'passwd': 'MysqlPass', 'port': 50506, 'db': 'Tv_server'}srcMysqlConfig_Tv_user = { 'host': 'MysqlHostInnerIp', # 'host': 'MysqlHostOuterIp', 'user': 'MysqlUser', 'passwd': 'MysqlPass', 'port': 50514, 'db': 'Tv_user'}srcMysqlConfig_Tv_seed = { 'host': 'MysqlHostInnerIp', # 'host': 'MysqlHostOuterIp', 'user': 'MysqlUser', 'passwd': 'MysqlPass', 'port': 50029, 'db': 'Tv_seed'}srcMysqlConfig_Tv_event = { 'host': 'MysqlHostInnerIp', # 'host': 'MysqlHostOuterIp', 'user': 'MysqlUser', 'passwd': 'MysqlPass', 'port': 50512, 'db': 'Tv_event'}srcMysqlConfig_Tv_hadoop_stat = { 'host': 'MysqlHostInnerIp', # 'host': 'MysqlHostOuterIp', 'user': 'MysqlUser', 'passwd': 'MysqlPass', 'port': 6605, 'db': 'Tv_hadoop_stat'}# target tmp data path configtmp_data_dir = '/home/hadoop/nisj/mysql2Hive/dataTmp'# log path configlog_dir = '/home/hadoop/nisj/mysql2Hive/logs'
1.2表级配置
/Users/nisj/PycharmProjects/mysql2Hive/conf/tableParConf.py
# -*- coding=utf-8 -*-from conf.systemParConf import *warnings.filterwarnings("ignore")srcTabConfig_game_zone = { 'srcMysql_config': srcMysqlConfig_jellyfish_server, 'src_tabName': 'game_zone', 'tabType': 'single', 'loadType': 'increment[substr(created_time,1,10)]'}srcTabConfig_game = { 'srcMysql_config': srcMysqlConfig_jellyfish_seed, 'src_tabName': 'game', 'tabType': 'single', 'loadType': 'whole'}srcTabConfig_match_apply = { 'srcMysql_config': srcMysqlConfig_jellyfish_event, 'src_tabName': 'bless_wall_201612', 'tabType': 'single', 'loadType': 'whole'}srcTabConfig_user_profile = { 'srcMysql_config': srcMysqlConfig_jellyfish_user, 'src_tabName': 'user_profile', 'tabType': 'submeter-256', 'loadType': 'whole'}srcTabConfig_live_history_status = { 'srcMysql_config': srcMysqlConfig_jellyfish_server, 'src_tabName': 'live_history_status', 'tabType': 'submeter-28', 'loadType': 'increment[substr(switch_time,1,10)]'}
2、公共代码部分
2.1、公共系统变量获取代码
/Users/nisj/PycharmProjects/mysql2Hive/public/getSystemPar.py
# -*- coding=utf-8 -*-import warningsimport datetimefrom conf.tableParConf import *warnings.filterwarnings("ignore")def getNowDay(): DayNow = datetime.datetime.today().strftime('%Y-%m-%d') return DayNowdef getYesterDay(): YesterDay = (datetime.datetime.today() - datetime.timedelta(1)).strftime('%Y-%m-%d') return YesterDaydef getRunDay(runDay): if runDay == '': runDay = getYesterDay() else: runDay = runDay return runDaydef dateRange(beginDate, endDate): dates = [] dt = datetime.datetime.strptime(beginDate, "%Y-%m-%d") date = beginDate[:] while date <= endDate: dates.append(date) dt = dt + datetime.timedelta(1) date = dt.strftime("%Y-%m-%d") return datesdef getSrcMysqlConfig(srcMysql_config): srcMysql_config = srcMysql_config return srcMysql_config['host'], srcMysql_config['port'], srcMysql_config['user'], srcMysql_config['passwd'], srcMysql_config['db']def getTabParName(tabParListFilter): with open("conf/tableParConf.py") as tabPar: tabParList = [] for ConfigLine in tabPar: if '#' not in ConfigLine and ConfigLine.replace('\n', '') != '' and ' = {' in ConfigLine: tabParList.append(ConfigLine.split(' = ')[0]) if tabParListFilter == ['']: tabParList = tabParList else: tabParList = tabParListFilter return tabParList# batch_test# print getRunDay(runDay='')
2.2、源表结构及字段、Hive建表脚本代码
/Users/nisj/PycharmProjects/mysql2Hive/public/getSrcMetadata.py
# -*- coding=utf-8 -*-import osimport refrom public.getSystemPar import *warnings.filterwarnings("ignore")def mysqlTabCreateScript(srcMysql_config, src_tabName, tabType, loadType): # 参数初始化赋值 host = getSrcMysqlConfig(srcMysql_config)[0] port = getSrcMysqlConfig(srcMysql_config)[1] user = getSrcMysqlConfig(srcMysql_config)[2] passwd = getSrcMysqlConfig(srcMysql_config)[3] db = getSrcMysqlConfig(srcMysql_config)[4] if tabType == 'single': srcTabName = src_tabName elif 'submeter' in tabType: srcTabName = src_tabName + "_0" srcTabStructure = os.popen("""source /etc/profile; \ /usr/bin/mysql -h{host} -P{port} -u{user} -p{passwd} -D{db} \ -N -e"set names utf8; \ select a2.column_name,case when a2.data_type like '%int' then 'bigint' else 'string' end data_type from information_schema.TABLES a1 left join information_schema.columns a2 on a1.TABLE_SCHEMA=a2.TABLE_SCHEMA and a1.TABLE_NAME=a2.TABLE_NAME where a1.TABLE_SCHEMA='{db}' and a1.table_name ='{srcTabName}' order by a2.ORDINAL_POSITION;" \ """ .format(host=host, port=port, user=user, passwd=passwd, db=db, srcTabName=srcTabName)).readlines(); srcTabCol_list = [] for stcList in srcTabStructure: stc = re.split('\t', stcList.replace('\n', '')) srcTabCol_list.append(stc) TabCreateScript = 'drop table if exists xxx_{src_tabName};\ncreate table xxx_{src_tabName}(\n'.format(src_tabName=src_tabName) colList = [] for srcColType in srcTabCol_list: TabCreateScript = TabCreateScript + '\`' + srcColType[0] + '\`' + ' ' + srcColType[1] + ',\n' colList.append(srcColType[0]) if 'whole' in loadType: TabCreateScript = TabCreateScript[:-2] + ")row format delimited fields terminated by '\t' lines terminated by '\n';" elif 'increment' in loadType: TabCreateScript = TabCreateScript[:-2] + ") partitioned by(data_day varchar(10)) row format delimited fields terminated by '\t' lines terminated by '\n';" return TabCreateScript, colList# Batch Test
2.3、日志记录代码
/Users/nisj/PycharmProjects/mysql2Hive/public/logRecord.py
import loggingimport datetimefrom conf.systemParConf import log_dirdef logRecord(message): logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(funcName)s %(levelname)s %(message)s', datefmt='%Y-%m-%d %a %H:%M:%S', filename='{log_dir}/dataEtl_{currDay}.log'.format(log_dir=log_dir, currDay=datetime.datetime.today().strftime('%Y-%m-%d')), filemode='a') logging.info('{message}'.format(message=message))
3、功能实现部分
3.1、Hive库建表
/Users/nisj/PycharmProjects/mysql2Hive/mysql2hive/hiveTabCreate.py
# -*- coding=utf-8 -*-from public.getSrcMetadata import *warnings.filterwarnings("ignore")def HiveCreateTab(srcMysql_config, src_tabName, tabType, logType): TabCreateScript = mysqlTabCreateScript(srcMysql_config, src_tabName, tabType, logType)[0] os.system("""/usr/lib/hive-current/bin/hive -e "{TabCreateScript}" """.format(TabCreateScript=TabCreateScript))# Batch Test
3.2、mysql数据向Hive传输
/Users/nisj/PycharmProjects/mysql2Hive/mysql2hive/mysqlData2Hive.py
# -*- coding=utf-8 -*-from public.getSrcMetadata import *from public.getSystemPar import *warnings.filterwarnings("ignore")def mysqlDataDownload(srcMysql_config, src_tabName, tabType, loadType, runDay): # 参数初始化赋值 host = getSrcMysqlConfig(srcMysql_config)[0] port = getSrcMysqlConfig(srcMysql_config)[1] user = getSrcMysqlConfig(srcMysql_config)[2] passwd = getSrcMysqlConfig(srcMysql_config)[3] db = getSrcMysqlConfig(srcMysql_config)[4] if os.path.exists('{tmp_data_dir}/'.format(tmp_data_dir=tmp_data_dir)) == False: os.system('mkdir -p {tmp_data_dir}/'.format(tmp_data_dir=tmp_data_dir)) colList = mysqlTabCreateScript(srcMysql_config, src_tabName, tabType, loadType)[1] allColChars = '' for colName in colList: allColChars = allColChars + 'replace(replace(replace(replace(\`' + colName + '\`,\',\',\'[comma]\'),\'\\n\',\'[newline-n]\'),\'\\r\',\'[newline-r]\'),\'\\t\',\'[tab]\')' + ',' if tabType == 'single': submeter_cnt = 1 elif 'submeter' in tabType: submeter_cnt = int(str(tabType).replace('submeter-', '')) if 'whole' in loadType: sql_where = '' elif 'increment' in loadType: sql_where = 'where ' + str(loadType.replace('increment[', '')).replace(']', '') + ' = \'{runDay}\''.format(runDay=runDay) os.system("rm -rf {tmp_data_dir}/xxx_{src_tabName}.txt".format(src_tabName=src_tabName, tmp_data_dir=tmp_data_dir)) for submeterPlus in range(0, submeter_cnt, 1): if submeter_cnt == 1: submeterPlus = '' else: submeterPlus = "_" + str(submeterPlus) os.system("""source /etc/profile; \ /usr/bin/mysql -h{host} -P{port} -u{user} -p{passwd} -D{db} \ -N -e"set names utf8; \ select {allColChars} from {db}.{src_tabName}{submeterPlus} {sql_where};" \ >>{tmp_data_dir}/xxx_{src_tabName}.txt \ """.format(host=host, port=port, user=user, passwd=passwd, db=db, src_tabName=src_tabName, tmp_data_dir=tmp_data_dir, allColChars=allColChars[:-1], submeterPlus=submeterPlus, sql_where=sql_where))def DataUploadHive(src_tabName, loadType, runDay): partitionProc_sql = """alter table xxx_{src_tabName} drop if exists partition (data_day = '{runDay}');alter table xxx_{src_tabName} add partition (data_day = '{runDay}');""".format(src_tabName=src_tabName, runDay=runDay) if 'whole' in loadType: os.system("""/usr/lib/hive-current/bin/hive -e "load data local inpath '{tmp_data_dir}/xxx_{src_tabName}.txt' overwrite into table xxx_{src_tabName};" """.format(src_tabName=src_tabName, tmp_data_dir=tmp_data_dir)) elif 'increment' in loadType: os.system("""/usr/lib/hive-current/bin/hive -e "{partitionProc_sql}load data local inpath '{tmp_data_dir}/xxx_{src_tabName}.txt' overwrite into table xxx_{src_tabName} partition (data_day = '{runDay}');" """.format(src_tabName=src_tabName, tmp_data_dir=tmp_data_dir, runDay=runDay, partitionProc_sql=partitionProc_sql)) os.system("rm -rf {tmp_data_dir}/xxx_{src_tabName}.txt ".format(src_tabName=src_tabName, tmp_data_dir=tmp_data_dir))def MysqlData2hive(srcMysql_config, src_tabName, tabType, loadType, runDay): mysqlDataDownload(srcMysql_config, src_tabName, tabType, loadType, runDay) DataUploadHive(src_tabName, loadType, runDay)# Batch Test
4、总控及调度
/Users/nisj/PycharmProjects/mysql2Hive/SerialBatch.py
# -*- coding=utf-8 -*-from mysql2hive.mysqlData2Hive import *from mysql2hive.hiveTabCreate import *from public.logRecord import logRecordwarnings.filterwarnings("ignore")def Mysql2Hive(srcMysql_config, src_tabName, tabType, loadType, isIncrementInt, runDay): if 'whole' in loadType: HiveCreateTab(srcMysql_config, src_tabName, tabType, loadType) elif 'increment' in loadType: if isIncrementInt == 'True': HiveCreateTab(srcMysql_config, src_tabName, tabType, loadType) MysqlData2hive(srcMysql_config, src_tabName, tabType, loadType, runDay)def mysql2HiveSerialBatch(isIncrementInt, runDay, tabParListFilter): for tabParConfigName in getTabParName(tabParListFilter): tabParConfig = eval(tabParConfigName) logRecord(message='Table {src_tabName}({tabType}) data {loadType} load from mysql to hive start...'.format(src_tabName=tabParConfig['src_tabName'], tabType=tabParConfig['tabType'], loadType=tabParConfig['loadType'])) Mysql2Hive(tabParConfig['srcMysql_config'], tabParConfig['src_tabName'], tabParConfig['tabType'], tabParConfig['loadType'], isIncrementInt, runDay) logRecord(message='Table {src_tabName}({tabType}) data {loadType} load from mysql to hive finished!'.format(src_tabName=tabParConfig['src_tabName'], tabType=tabParConfig['tabType'], loadType=tabParConfig['loadType']))def mysql2HiveSerialCtl(isIncrementInt, runDayList, tabParListFilter): for runDay in runDayList: mysql2HiveSerialBatch(isIncrementInt=isIncrementInt, runDay=getRunDay(runDay), tabParListFilter=tabParListFilter)# 当且仅当【isIncrementInt='True'】时,进行增量传输数据的初始化;其他任何情况都不进行增量传输数据初始化# 当【runDayList = ['']】时,进行正常数据跑批;其他则是list日期内的数据跑批,如runDayList = ['2017-06-18','2017-06-19','2017-06-20']# 当【tabParListFilter = ['']】时,则不进行表配置的过滤,tabParCong中所有未被注释的表配置都会进行数据的跑批;过滤如:tabParListFilter = ['srcTabConfig_game_zone','srcTabConfig_game']# 日常正常跑批示例:tabParListFilter = ['']isIncrementInt = 'F'runDayList = ['']mysql2HiveSerialCtl(isIncrementInt, runDayList, tabParListFilter)# 若干表的补数# tabParListFilter = ['srcTabConfig_live_history_status']# isIncrementInt = 'F'# runDayList = ['2017-06-21','2017-06-22','2017-06-23']# mysql2HiveSerialCtl(isIncrementInt, runDayList, tabParListFilter)
5、生产定时
生产上crontab执行:
[hadoop@emr-worker-9 mysql2Hive]$ crontab -l48 10 * * * python /home/hadoop/nisj/mysql2Hive/SerialBatch.py >> /home/hadoop/nisj/mysql2Hive/logs/SerialBatch.log 2>&1
说明:真用crontab调的时候,有几个路径问题要写成绝对路径。
附:程序路径
阅读全文
0 0
- Python自动化拉取Mysql数据并装载到Hive(V3.0)
- Python自动化拉取Mysql数据并装载到Hive
- Python自动化拉取Mysql数据并装载到Hive
- Python自动化拉取Mysql数据并装载到Hive(V2.0)
- Python自动化拉取Mysql数据并装载到Oracle
- Python将Mysql分表数据按小时增量装载到Hive示例
- 使用Python脚本从Hive中取数据计算后加载到Mysql示例
- 用Python将json数据装载到mysql
- Hive搭建并将元数据存储到MySQL
- Hive装载数据命令
- Python从阿里云Oss拉数据写入Hive表并进行相关处理
- Python爬取数据并写入MySQL数据库
- hive——拉取数据的一个实例
- shell拉取MYSQL数据库数据
- python 远程连接MySQL数据库 拉取数据存至本地文件
- hive 表装载数据详解
- hive数据去重,并取指定的一条数据
- #python学习笔记#使用python爬取网站数据并保存到数据库
- 6.26~6.29模拟总结
- Yii2的高级应用程序模板中设置隐藏 index.php 步骤
- 关于使用Java实现的简单网络爬虫Demo
- 使用@ViewChild @ViewChildren(ngAfterViewInit), @ViewChild@ViewChildren(ngAfterContentInit)
- java使用new Date()和System.currentTimeMillis()获取当前时间戳
- Python自动化拉取Mysql数据并装载到Hive(V3.0)
- MCC(移动国家码)和 MNC(移动网络码)
- Lucene
- 剑指offer面试题17-:合并两个排序链表
- 计算机基础导论
- 查看等待类型
- Android开发环境搭建-Ubuntu系统
- Java NIO Selector
- MapReduce1.0和MapReduce2.0