Linux下和MySQL下利用python插入亿万级数据

来源：互联网发布：汽车模具编程编辑：程序博客网时间：2024/05/21 11:12

下载mysqldb

首先必须下载mysqldb，下载语句是

yum install MySQL-python

安装之后，在命令行输入

#>>>python#接下来是python代码>>>import MySQLdb>>>#显示出了命令行即为安装成功>>>exit()#退出python

插入亿万级数据

作者在写SQL时就知道2种写法（大牛另说），（1）一次插入一条，（2）一次插入多条。
之前在书上看到，一次插入多条会提高sql语句的速度，所以接下来就以（2）为例进行实验插入1000万条的数据。
首先随便建张表

CREATE TABLE `good` (  `id` int(10) NOT NULL AUTO_INCREMENT,  `name` varchar(255) DEFAULT NULL,  `price` double DEFAULT NULL,  `color` varchar(255) DEFAULT NULL,  `goodNum` int(11) DEFAULT NULL,  `brandName` varchar(255) DEFAULT NULL,  PRIMARY KEY (`id`)  ）

Linux下代码如下

import sysimport osimport timeimport random as rdimport MySQLdb as mddef test():  con=md.connect(host="localhost",user="root",passwd="admin123",db="test")  cursor=con.cursor()  #sql1="truncate table product"  #n=cursor.execute(sql1)  tm1=time.time()  oriName="sujaloiushtegsk"  oriPrice=5000  oriPid=1831098  for i in range(10000):    sql="insert into good(name,price,color,goodNum,brandName) values"    #sql="select * from product"    tm=time.time()    for j in range(10000):      #print sql      N1=rd.randint(1,14)      N2=rd.randint(1,14)      N3=rd.randint(1,14)      PP=rd.randint(200,1500)      ppid=rd.randint(1,10000)      name=oriName[N1]+oriName[N2]+oriName[N3]      brandName=oriName[N3]+oriName[N1]      color=oriName[N1]+oriName[N3]      goodNum=oriPid+ppid      price=oriPrice+PP      if j<=9998:        sql=sql+"("+"'"+str(name)+"'"+","+str(price)+","+"'"+str(color)+"'"+","+str(goodNum)+","+"'"+str(brandName)+"'"+")"+","        #sql=sql+"('123','apple7','6000','aaa','china')"+","      else:        sql=sql+"("+"'"+str(name)+"'"+","+str(price)+","+"'"+str(color)+"'"+","+str(goodNum)+","+"'"+str(brandName)+"'"+")"+";"        #sql=sql+"("+str(pid)+","+"'"+str(pname)+"'"+","+str(price)+","+"'"+str(buyer)+"'"+","+"'"+str(city)+"'"+")"+';'        #sql=sql+"('123','apple7','6000','aaa','china')"+";"      #print j     #print sql    n=cursor.execute(sql)    con.commit()    a=time.time()    print "the"+str(i+1)+"'s time is :"+str(a-tm)  tm2=time.time()  print str(tm2-tm1)  con.close()if __name__=="__main__":  test()

表名一类的那些大家根据自己的情况修改。作者代码水平欠缺，希望大家别嫌弃~~~
根据我的检测，2000万的数据集大概跑了400s，1亿条的数据跑了1860s=31分钟。因为插入的数据相对简单，并且数据维度比较小，所以还是很快的。
若有其他的改进建议，希望大家不吝赐教。

0 0