如何处理大数据文件+录入数据

来源：互联网发布：sap2000 知乎编辑：程序博客网时间：2024/05/21 19:32

楼主喜欢用Java应对各种小需求，以此提高工作效率。

客户在集群上提供了一份.sql文件，有2个多G，用vim等编辑器打不开，只能less一部分，而且内容有乱码（中文部分，也不清楚该份文件的编码格式）——改一下vim的字符集配置就可以解决。
下载文件到本地，尝试用notepad++打开，提示“File is to be opened by Notepate++”；用MySQL Workbench打开，出现卡死。
使用文件分割器，对其进行分割。把文件拆分成15等份，每份150MB。
通过less可以看到建表语句，为oracle，改成postgresql版，并建好表。打开1.zg，把insert into之外的语句删掉后，用Navicat for PostgreSQL工具运行sql，出现字符集错误。
这里写图片描述

以下为数据样式示例：

insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)values ('GD005', '460018172970051', '8986010712769238551', 2400, '123456', 0, 7, '皇岗集散中心（动）08.01.31－30', 2500, 2501, 57440, '1012', null, null, null, 30, to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss'), null, 2013, null, null, 27, 'AYaGD005', null, null, 32);insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)values ('GD005', '460018172966280', '8986010712769234780', 2400, '123456', 0, 7, '集散中心(动)08.1.9-30', 2500, 2501, 57420, '0034', 0, 0, null, 30, to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss'), null, 2013, null, null, 27, 'AYaGD005', null, null, 32);

我们可以发现文件中携带to_date('27-06-2008 10:38:50', 'dd-mm-yyyy hh24:mi:ss')函数，该函数在mysql中没找到，但是postgresql有，所以我们没有必要花大幅功力去切割这个函数。
在实践中，想要对每个文件进行insert into table values(...),values(...)....
最后证实在values中有嵌入函数时，是不能采用该策略的，所以该部分代码不贴。
同时，大家都知道，用oracle工具导出文件时，会有如下：

commit;
prompt 10000 records committed…
commit;
prompt 20000 records committed…
……

这些语句要记得处理。切割并不能保证每份文件都是完整的，所以sql语句的不完整只会在头和尾。有如下几种情况：

1.zg 尾：insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM,

2.zg 头：SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)
values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散中心(动)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);

1.zg 尾：values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散

2.zg 头：中心(动)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);

1.zg 尾：insert into TB_IMSI_PARAM (NUM, IMSI, SIMNO, COST, PWD, MONTHFEE, EXPDATE, REMARK, CALLFEE, CALLEDFEE, GROUP_ID, GROUP_SHORT_NUM, SHORTNUM_FEE, CONF_FEE, IN_DATE, CARD_TYPE, MOD_DATE, IS_LONG_CARD, SMS_MOFEE, PUKCODE, ACTIVECODE, WIFI_MOFEE, AGENTCODE, SALE_PRICE, UPDATE_TIME, PRODUCT_ID)

2.zg 头：values (‘GD005’, ‘460018172966280’, ‘8986010712769234780’, 2400, ‘123456’, 0, 7, ‘集散中心(动)08.1.9-30’, 2500, 2501, 57420, ‘0034’, 0, 0, null, 30, to_date(‘27-06-2008 10:38:50’, ‘dd-mm-yyyy hh24:mi:ss’), null, 2013, null, null, 27, ‘AYaGD005’, null, null, 32);

编码如下：

package com.sibat.uhuibao;import java.io.BufferedReader;import java.io.BufferedWriter;import java.io.FileInputStream;import java.io.FileOutputStream;import java.io.IOException;import java.io.InputStreamReader;import java.io.OutputStreamWriter;import java.util.Collections;import java.util.List;import com.zh.zsr.FilePath;/** *  * @author nanphonfy */public class BigSQLFinal {    public static void main(String[] args) throws IOException {        String readFile = "C:\\Users\\sibat\\Desktop\\1";//把15份分割文件放在该目录        String writeFile = "C:\\Users\\sibat\\Desktop\\2\\";//把处理后的文件放在该目录        String errorFile = "C:\\Users\\sibat\\Desktop\\3\\";//把处理的错误文件放在这边,即error.sql        String line = "";        FilePath fp = new FilePath();        List<String> readPath = fp.getFiles(readFile);        Collections.sort(readPath);        for (String p : readPath) {            System.out.println(p);        }        FileInputStream fis = null;        InputStreamReader isw = null;        BufferedReader br = null;// 把filewriter的写法写成FileOutputStream形式        int count = 0;        FileOutputStream efos = new FileOutputStream(errorFile + "error.sql");        OutputStreamWriter eosw = new OutputStreamWriter(efos, "UTF-8");        BufferedWriter ebw = new BufferedWriter(eosw);// 把filewriter的写法写成FileOutputStream形式        for (String path : readPath) {            String arr[] = path.split("\\\\");// 为了得到文件名            int length = arr.length;            fis = new FileInputStream(path);            isw = new InputStreamReader(fis, "GBK");//客户给的文件是GBK的            br = new BufferedReader(isw);// 把filewriter的写法写成FileOutputStream形式            String name = arr[length - 1].replace(".zg", "") + ".sql";            FileOutputStream fos = new FileOutputStream(writeFile + name);            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");//数据库是设为UTF-8的，所以写入的时候要从GBK转码            BufferedWriter bw = new BufferedWriter(osw);// 把filewriter的写法写成FileOutputStream形式            long a = System.currentTimeMillis();            int num = 0;// 第一行            boolean flag = false;// 用来标记第一行是否完整，如果不完整，当第二行为values时，要存入错误文件            while ((line = br.readLine()) != null) {                if (line.isEmpty())                    continue;                if (num == 1) {                    if (flag == true) {                        if (line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID")) {                            bw.write(line);                            bw.newLine();                            bw.flush();                            num++;                            continue;                        } else {                            ebw.write(line);                            ebw.newLine();                            ebw.flush();                            flag = false;                            num++;                            continue;                        }                    }                }                if (num == 0) {                    if (line.contains("values (") && line.contains(");")) {                        ebw.write(line);                        ebw.newLine();                        ebw.flush();                        num++;                        continue;                    } else if (line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID")) {                        bw.write(line);                        bw.newLine();                        bw.flush();                        num++;                    } else {// 包括残缺，所以第二行可能为insert                        ebw.write(line);                        ebw.newLine();                        ebw.flush();                        flag = true;                        num++;                    }                } else if ((line.contains("insert into TB_IMSI_PARAM (") && line.contains("PRODUCT_ID"))                        || (line.contains("values (") && line.contains(");"))) {                    bw.write(line);                    bw.newLine();                    bw.flush();                    num++;                } else {                    if (line.contains("commit;") || line.contains("records committed..."))                        continue;                    ebw.write(line);                    ebw.newLine();                    ebw.flush();                    num++;                    // System.out.println(line);                }            }            ebw.write("=====" + path + "=====\n\n");            long b = System.currentTimeMillis();            System.out.println(name + "文件耗时:" + (b - a) + "\n");        }    }}

处理完后，可以用gvim 对检查各个文件的头尾。确保无误之后，就可以运行sql了。因为在Navicat for PostgreSQL工具不支持批量运行sql，每次都要运行完再运行下一个，效率不高。如何批量运行，代码如下：

package com.sibat.uhuibao;import java.io.BufferedReader;import java.io.FileInputStream;import java.io.InputStreamReader;import java.sql.Connection;import java.sql.Statement;import java.util.LinkedList;import java.util.List;import com.sibat.uhuibao.util.DBUtil;import com.zh.zsr.FilePath;/** * 读取 SQL 脚本并执行 *  * @author nanphonfy */public class SqlFileExecutor {    /**     * 传入连接来执行 SQL 脚本文件     *      * @param conn     *            传入数据库连接     * @param sqlFile     *            SQL 脚本文件     * @throws Exception     */    public void execute(Connection conn, String sqlFile) throws Exception {        FileInputStream fis = null;        InputStreamReader isw = null;        BufferedReader br = null;// 把filewriter的写法写成FileOutputStream形式        fis = new FileInputStream(sqlFile);        isw = new InputStreamReader(fis, "UTF-8");        br = new BufferedReader(isw);// 把filewriter的写法写成FileOutputStream形式        String line = null;        long a = System.currentTimeMillis();        int num = 0;        String tmp = null;        List<String> sqlList = new LinkedList<>();//因为路径是存放在这里面的，所以内存会被撑爆        while ((line = br.readLine()) != null) {            num++;            if (num == 1) {                tmp = line;            }            if (num == 2) {                num = 0;                tmp = tmp.concat(line);                sqlList.add(tmp);            }        }        Statement stmt = null;        stmt = conn.createStatement();        for (String sql : sqlList) {            stmt.addBatch(sql);        }        stmt.executeBatch();        System.out.println(sqlFile + "执行成功！！！！");    }    public static void main(String[] args) throws Exception {        // List<String> sqlList = new SqlFileExecutor().loadSql(args[0]);        // System.out.println("size:" + sqlList.size());        // for (String sql : sqlList) {        // System.out.println(sql);        // }        SqlFileExecutor executor = new SqlFileExecutor();        String readFile = "C:\\Users\\sibat\\Desktop\\整理：final";//确保可以执行成功的文件        FilePath fp = new FilePath();        List<String> readPath = fp.getFiles(readFile);        Connection conn = DBUtil.getConnection();        for (String path : readPath) {            executor.execute(conn, path);        }    }}

FilesUtil.java

package com.sibat.uhuibao.util;import java.io.File;import java.util.ArrayList;import java.util.LinkedList;import java.util.List;/** * JAVA遍历一个文件夹中的所有文件 *  * @author nanphonfy * @time 2016年8月23日 下午3:34:18 */public class FilesUtil {    private List<String> absolutePaths = new LinkedList<>();    /*     * 通过递归得到某一路径下所有的目录及其文件     */    public List<String> getFiles(String filePath) {        File root = new File(filePath);        File[] files = root.listFiles();        for (File file : files) {            if (file.isDirectory()) {                getFiles(file.getAbsolutePath());            } else {                if (!file.getAbsolutePath().toString().contains("_SUCCESS"))                    absolutePaths.add(file.getAbsolutePath().toString());            }        }        return absolutePaths;    }}

DBUtil.java

package com.sibat.uhuibao.util;import java.sql.SQLException;import javax.sql.DataSource;import com.mchange.v2.c3p0.ComboPooledDataSource;/** *  * @author nanphonfy */public class DBUtil {    private static DataSource dataSource = null;// 数据源一份就可以了，所以用static    static {        // 数据源只能被创建一次        dataSource = new ComboPooledDataSource("XXX");    }    /**     * 返回一个数据源的connection对象     *      * @return     * @throws SQLException     */    public static java.sql.Connection getConnection() throws SQLException {// 这里要转为这种类型        return dataSource.getConnection();    }    /**     * 释放连接     *      * @param connection     */    public static void releaseConnection(java.sql.Connection connection) {        try {            if (connection != null) {                connection.close();            }        } catch (Exception e) {            e.printStackTrace();        }    }}

c3p0-config.xml

<?xml version="1.0" encoding="UTF-8"?><c3p0-config>    <named-config name="XXX">        <property name="user">XXX</property>        <property name="password">XXX</property>        <property name="driverClass">org.postgresql.Driver</property>        <property name="jdbcUrl">            jdbc:postgresql://localhost:5432/数据库名        </property>        <property name="acquireIncrement">5</property>        <property name="initialPoolSize">10</property>        <property name="minPoolSize">10</property>        <property name="maxPoolSize">50</property>        <property name="checkoutTimeout">0</property>        <property name="maxStatements">20</property>        <property name="maxStatementsPerConnection">5</property>        <!--每60秒检查所有连接池中的空闲连接。Default: 0 -->        <property name="idleConnectionTestPeriod">60</property>    </named-config></c3p0-config>

因为语句数太多了，所以内存可能会被撑爆。
跳出这样的错误：Error java.lang.OutOfMemoryError: GC overhead limit exceeded
解决方法如下：

Just increase the heap size a little by setting this option in
Run → Run Configurations → Arguments → VM arguments
-Xms3072M -Xmx4096M
Xms - for minimum limit
Xmx - for maximum limit

15份文件可以分3次执行完。
最后再执行error.sql文件。

然后，通过postgresql导出sql文件，数据格式变成这样：

INSERT INTO "public"."tb_imsi_param" VALUES ('1', null, '460018172943802', '8986010512769124302', '1000', '123456', '0', '10', null, '2500', '2501', null, null, '0', '0', null, '30', '2008-06-27', '0', '2013', null, null, '27', 'AYaOTHERS', null, null, '32');INSERT INTO "public"."tb_imsi_param" VALUES ('2', 'GD005', '460018172966280', '8986010712769234780', '2400', '123456', '0', '7', '集散中心(动)08.1.9-30', '2500', '2501', '57420', '0034', '0', '0', null, '30', '2008-06-27', null, '2013', null, null, '27', 'AYaGD005', null, null, '32');

我们发现，postgresql导出后就没有to_date函数了，再写个程序，只留下数据，最终提交给数据分析人员，用Apache Pig分析。（很简单，就不贴出来了）
样例：

'8', 'GD005', '460018172969895', '8986010712769238395', '2400', '123456', '0', '7', '皇岗集散中心（动）08.01.31－30', '2500', '2501', '57440', '1036', null, null, null, '30', '2008-06-27', null, '2013', null, null, '27', 'AYaGD005', null, null, '32'

以下为本文小结：

关于处理大文件imsi_param2016.sql：
该文件有2个多G，存放的内容是客户oracle数据库中的某张表数据。用Linux的vim等编辑器不能打开，只能less一小部分。
①下载到本地；
②使用notepad++、MySQL Workbench等工具无法打开如此大的文件；
③使用postgresql运行.sql，出现字符集编码不一致导入失败的问题；
④使用“橘子分割”器，把文件拆分成15等份，每份150MB，再通过Java程序对文件进行处理，过滤得到完整的sql。把oracle表转换为postgresql版（因为to_date函数mysql没有）；
⑤再分别对15个sql文件，转码成UTF-8；
⑥用程序，把每个文件不完整的sql抽出，并整合成一份error.sql，再人工调整格式；
⑦15分sql文件，逐一运行，确实麻烦，写程序自动运行，一次搞定；
⑧经过如上处理，oracle可转为postgresql且错误率0%。

参考：
http://stackoverflow.com/questions/1393486/error-java-lang-outofmemoryerror-gc-overhead-limit-exceeded

作者： @nanphonfy
Email: nanphonfy (Nfzone) gmail.com 请将(Nfzone)换成@

0 0