Python进行数据的多表去重示例

来源:互联网 发布:淘宝千牛发货流程 编辑:程序博客网 时间:2024/04/29 11:14
1、Python集合及其交并差操作
python的set和其他语言类似, 是一个无序不重复元素集, 基本功能包括关系测试和消除重复元素. 集合对象还支持union(联合), intersection(交), difference(差)和sysmmetric difference(对称差集)等数学运算.  
sets 支持 x in set, len(set),和 for x in set。作为一个无序的集合,sets不记录元素位置或者插入点。因此,sets不支持 indexing, slicing, 或其它类序列(sequence-like)的操作。
集合支持一系列标准操作,包括并集、交集、差集和对称差集,例如:
a = t | s          # t 和 s的并集b = t & s          # t 和 s的交集c = t – s          # 求差集(项在t中,但不在s中)d = t ^ s          # 对称差集(项在t或s中,但不会同时出现在二者中)

2、应用到数据去重上的一个示例
除了去重之外,还可以进行集合的取交、并、差集等操作,本例仅是将数据去重取并集。

/Users/nisj/PycharmProjects/EsDataProc/mysql_submeter_distinct.py

# -*- coding=utf-8 -*-import MySQLdbimport warningsimport datetimeimport timewarnings.filterwarnings("ignore")db_config_src1 = {    'host': 'src-ip',    'user': 'src-user',    'passwd': 'src-passwd',    'port': src-port,    'db': 'jellyfish_server'}db_config_src2 = {    'host': 'src-ip',    'user': 'src-user',    'passwd': 'src-passwd',    'port': src-port,    'db': 'jellyfish_stat'}db_config_tgt = {    'host': '172.16.96.34',    'user': 'root',    'passwd': '123',    'port': 3306,    'db': 'timeline'}def getDB(host, user, passwd, port, db):    try:        conn = MySQLdb.connect(host=host, user=user, passwd=passwd, port=port ,db=db)        conn.autocommit(True)        curr = conn.cursor()        curr.execute("SET NAMES utf8");        curr.execute("USE %s" % db);        return conn, curr    except MySQLdb.Error, e:        print "Mysql Error %d: %s" % (e.args[0], e.args[1])        return None, Noneconn1, curr1 = getDB(host=db_config_src1['host'], user=db_config_src1['user'], passwd=db_config_src1['passwd'],port=db_config_src1['port'],db=db_config_src1['db'])conn2, curr2 = getDB(host=db_config_src2['host'], user=db_config_src2['user'], passwd=db_config_src2['passwd'],port=db_config_src2['port'],db=db_config_src2['db'])today = datetime.date.today()yesterday = today - datetime.timedelta(days=1)tomorrow = today + datetime.timedelta(days=1)now_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_timesql_text = "select table_name from information_schema.tables where table_schema='jellyfish_server' and table_name like 'live_point_9%';"curr1.execute(sql_text)tab_list = curr1.fetchall()live_point_datas_union = set()for tab_name in tab_list:    tab_name = tab_name[0]    sql_text = "select live_status_id,room_id from jellyfish_server.%s ;" % (tab_name)    curr1.execute(sql_text)    live_point_datas = curr1.fetchall()    live_point_datas = set(live_point_datas)    live_point_datas_union = live_point_datas_union | live_point_datasi = 0for x in live_point_datas_union:    i += 1    print i,x[0],x[1]print "总的不重复RoomId数是:",inow_time = time.strftime('%Y-%m-%d %X', time.localtime())print "当前时间是:",now_timecurr1.close()conn1.close()curr2.close()conn2.close()

0 0
原创粉丝点击