数据库插入百万数据

来源：互联网发布：linux查看笔记本配置编辑：程序博客网时间：2024/04/27 16:43

这是对一次数据库作业的深究

首先说一下作业题目要求：

建立一张包含四个字段的表，表名为test

第一列为id，主键，自增。

第二列为col1，随机为Mike，Bob，Jack，Alice，Cathy，Ann，Betty，Cindy，Mary，Jane中的一个

第三列为col2，随机为一个5位字母，字母限制在a-e

第三列为col3，随机为一个1-20之间的整数

按照步骤一中对表的要求插入100万条记录，记录执行的时间

对要插入的数据范围进行一定的预处理

(1)对于col1，创建取值范围数组

private static String[] col1Values={"Mike","Bob","Jack","Alice","Cathy","Ann","Betty","Cindy","Mary","Jane"};

随机获取的时候只要调用 col1Values[(int)(Math.random()*10)] 即可。

(2)对于col2，通过递归创建取值范围数组

private static String[] col2Values=new String[3125];static{point=0;initCol2Value(5,new StringBuffer(""));}private static void initCol2Value(int n,StringBuffer str){if(n==0){col2Values[point++]=new String(str);return;}for(int i=0;i<5;i++){StringBuffer strTemp = new StringBuffer(str);initCol2Value(n-1,strTemp.append((char)('a'+i)));}}

随机获取的时候只要调用col2Values[(int)(Math.random()*3125)]即可。

(3)对于col3，随机获取的时候只要(int)(Math.random()*20)+1即可。

插入大数据量的数据

(1)首先想到的方法当然是传统的一行一行的插入方法：通过Connection获得Statement，再调用Statement对象的execute函数执行sql语句，插入一行，这样循环100万次即可，但是时间复杂度太高，估计没有个把小时是搞不定的。

(2)然后想到了对sql语句进行预处理，于是很大程度上提高了效率。下面是这部分代码的核心部分。

public static void insertData() {try {System.out.println("start insert data");Long beginTime = System.currentTimeMillis();conn.setAutoCommit(false);PreparedStatement pst = conn.prepareStatement("INSERT INTO test(col1,col2,col3)values(?,?,?)");for (int i = 1; i <= 1000000; i++) {pst.setString(1, col1Values[(int) (Math.random() * 10)]);pst.setString(2, col2Values[(int) (Math.random() * 3125)]);pst.setInt(3, (int) (Math.random() * 20) + 1);pst.execute();}conn.commit();pst.close();Long endTime = System.currentTimeMillis();System.out.println("end insert data");System.out.println("insert time: " + (double) (endTime - beginTime)/ 1000 + " s");System.out.println();} catch (SQLException ce) {System.out.println(ce);}}

测试结果如下：

start insert data
end insert data
insert time: 110.215 s

(3)对于上面的结果还是不太满意，于是便开始了探索。

(a)从网上看到一个方法，使用在PreparedStatement 类上的addBatch(),executeBatch()方法，通过批量处理，可以一次性的将1000甚至10000个sql插入操作作为一个事务进行批量优化，并且作者在oracle的数据库上测试过时间是低于10s的。于是我也尝试了一下，发现依然是107s左右，于是便迷茫了。

(b)这个时候看到网上的另外一篇文章，解释了为什么MySql的JDBC驱动不支持批量操作，原来Mysql不支持addBatch(),executeBatch()等方法的批量优化，而Oracle则数据库支持，并且可以在360 ms左右的时间插入100万条记录

网址：http://elf8848.iteye.com/blog/770032

(c)后来看到葛班长的日志，他通过Python在SQLite中插入100万条数据只用了4秒，原因在于Python对所有的这100万条插入语句进行了优化，将所有的插入操作放到了同一个事务中，这样极大的减少了开启和取消事务的时间，而正是这部分操作会消耗大量的时间。

网址：http://aegiryy.net/?p=380

(d)于是我受到了启发，并且了解到对于Mysql数据库的操作时，一个sql插入语句中可以插入多行数据。于是我尝试通过StringBuffer构造一个比较大的sql语句，每个语句可以插入1万行的数据（如果是10万或者100万的话会超出堆内存限制），这样循环100次即可完成插入。下面是这种方法的核心代码：

public static void insertData() {try {System.out.println("start insert data");Long beginTime = System.currentTimeMillis();Statement st = conn.createStatement();for (int i = 0; i < 100; i++) {StringBuffer sqlBuffer = new StringBuffer("insert into test (col1,col2,col3) values");sqlBuffer.append(" (/""+ col1Values[(int) (Math.random() * 10)] + "/",/""+ col2Values[(int) (Math.random() * 3125)] + "/","+ ((int) (Math.random() * 20) + 1) + ")");for (int j = 2; j <= 10000; j++) {sqlBuffer.append(" ,(/""+ col1Values[(int) (Math.random() * 10)] + "/",/""+ col2Values[(int) (Math.random() * 3125)] + "/","+ ((int) (Math.random() * 20) + 1) + ")");}sqlBuffer.append(";");String sql = new String(sqlBuffer);st.execute(sql);}Long endTime = System.currentTimeMillis();System.out.println("end insert data");System.out.println("insert time: " + (double) (endTime - beginTime)/ 1000 + " s");System.out.println();} catch (SQLException ce) {System.out.println(ce);}}

测试结果如下：

start insert data
end insert data
insert time: 15.083 s

(e)最后我想到了再将这种方法优化，采用预处理的方式，在代码易读性和效率上都有所提高，虽然效率提高的不多。下面是这个方法的核心代码：

public static void insertData() {try {conn.setAutoCommit(false);StringBuffer sqlBuffer = new StringBuffer("insert into test (col1,col2,col3) values");sqlBuffer.append("(?,?,?)");for (int j = 2; j <= 10000; j++) {sqlBuffer.append(",(?,?,?)");}sqlBuffer.append(";");String sql = new String(sqlBuffer);PreparedStatement pst = conn.prepareStatement(sql);System.out.println("start insert data");Long beginTime = System.currentTimeMillis();for (int i = 0; i < 100; i++) {for (int j = 0; j < 10000; j++) {pst.setString(3 * j + 1,col1Values[(int) (Math.random() * 10)]);pst.setString(3 * j + 2,col2Values[(int) (Math.random() * 3125)]);pst.setInt(3 * j + 3, (int) (Math.random() * 20) + 1);}pst.execute();}conn.commit();pst.close();Long endTime = System.currentTimeMillis();System.out.println("end insert data");System.out.println("insert time: " + (double) (endTime - beginTime)/ 1000 + " s");System.out.println();} catch (SQLException ce) {System.out.println(ce);}}

测试结果如下：

start insert data
end insert data
insert time: 14.47 s

最后贴出最终个解决方案的所有代码：

package godfrey.nju;import java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import java.sql.Statement;public class TestDB2 {private static String dbClassName = "com.mysql.jdbc.Driver";private static String dbUrl = "jdbc:mysql://localhost:3306/db_test";private static String dbUser = "root";private static String dbPwd = "123";private static Connection conn = null;private static String[] col1Values = { "Mike", "Bob", "Jack", "Alice","Cathy", "Ann", "Betty", "Cindy", "Mary", "Jane" };private static String[] col2Values = new String[3125];private static int point;public static void main(String args[]) {insertData();//query1();//clearData();}public static void insertData() {try {conn.setAutoCommit(false);StringBuffer sqlBuffer = new StringBuffer("insert into test (col1,col2,col3) values");sqlBuffer.append("(?,?,?)");for (int j = 2; j <= 10000; j++) {sqlBuffer.append(",(?,?,?)");}sqlBuffer.append(";");String sql = new String(sqlBuffer);PreparedStatement pst = conn.prepareStatement(sql);System.out.println("start insert data");Long beginTime = System.currentTimeMillis();for (int i = 0; i < 100; i++) {for (int j = 0; j < 10000; j++) {pst.setString(3 * j + 1,col1Values[(int) (Math.random() * 10)]);pst.setString(3 * j + 2,col2Values[(int) (Math.random() * 3125)]);pst.setInt(3 * j + 3, (int) (Math.random() * 20) + 1);}pst.execute();}conn.commit();pst.close();Long endTime = System.currentTimeMillis();System.out.println("end insert data");System.out.println("insert time: " + (double) (endTime - beginTime)/ 1000 + " s");System.out.println();} catch (SQLException ce) {System.out.println(ce);}}public static void query1() {try {System.out.println("start query1: 'select count(*) from test group by col1 order by count(*);'");Long beginTime = System.currentTimeMillis();Statement st = conn.createStatement();String sql = "select count(*) from test group by col1 order by count(*);";ResultSet rs = st.executeQuery(sql);Long endTime = System.currentTimeMillis();System.out.println("result:");while (rs.next()) {System.out.println(rs.getInt(1));}System.out.println("query1 time: " + (double) (endTime - beginTime)/ 1000 + " s");st.close();conn.close();} catch (Exception e) {e.printStackTrace();}}public static void clearData() {try {System.out.println("start delete all data");Long beginTime = System.currentTimeMillis();Statement st = conn.createStatement();String sql = "delete from test";st.execute(sql);st.close();conn.close();Long endTime = System.currentTimeMillis();System.out.println("end delete all data");System.out.println("delete time: " + (double) (endTime - beginTime)/ 1000 + " s");} catch (Exception e) {e.printStackTrace();}}static {try {Class.forName(dbClassName).newInstance();conn = DriverManager.getConnection(dbUrl, dbUser, dbPwd);} catch (Exception e) {e.printStackTrace();}point = 0;initCol2Value(5, new StringBuffer(""));}private static void initCol2Value(int n, StringBuffer str) {if (n == 0) {col2Values[point++] = new String(str);return;}for (int i = 0; i < 5; i++) {StringBuffer strTemp = new StringBuffer(str);initCol2Value(n - 1, strTemp.append((char) ('a' + i)));}}}