DL4J数据预处理之Schema

来源：互联网发布：遗传算法密码学编辑：程序博客网时间：2024/06/05 01:13

DL4J数据预处理之Schema

在DL4J中，有时我们需要对CSV文件进行预处理，在做预处理的第一步，就是定义该CSV文件Schema，本篇文章，就Schema的定义，进行展开说明（这里就采用官方的Example了，确实定义这部分相对来说较为容易），后续会将预处理的Transform、Normalize部分陆续写出来。

主要方法说明

添加String类型列
- public Builder addColumnString(String name) ：添加一个基本的String类型的列，参数为列名
- public Builder addColumnsString(String… columnNames) ：添加多个String类型的列，参数为多个列名
添加Number类型列
- public Builder addColumnInteger(String name) ：添加一个Integer类型的列，参数为列名
- public Builder addColumnsInteger(String… names) ：添加多个Integer类型的类，参数为多个列名
- public Builder addColumnInteger(String name, Integer minAllowedValue, Integer maxAllowedValue) ：添加一个Integer类型的列，并设置能接受的最大和最小值
- Long、Float、Double和Integer类似，在此不详细列出，下面列出Double中不一样的方法
- public Builder addColumnDouble(String name, Double minAllowedValue, Double maxAllowedValue, boolean allowNaN, boolean allowInfinite)：添加一个Double类型的列，前三个参数类似Integer，后面两个布尔类型参数是是否允许NaN，是否允许Infinite
添加Categorical类型列
- public Builder addColumnCategorical(String name, String… stateNames) ：添加一个Categorical类型的列，第一个参数为列名，第二个方法为具体的类项
- public Builder addColumnCategorical(String name, List stateNames)：同上，只不过修改了第二个参数的类型，改为传List
Schema的保存
- public String toYaml() ：将Schema导出为Yaml格式的字符串
- public String toJson() ：将Schema导出为Json格式的字符串
Schema的加载
- public static Schema fromYaml(String yaml)：通过Yaml格式的字符串生生Schema
- public static Schema fromJson(String json) ：通过Json格式的字符串生成Schema

Demo

数据源

2016-01-01 17:00:00.000,830a7u3,u323fy8902,1,USA,100.00,Legit2016-01-01 18:03:01.256,830a7u3,9732498oeu,3,FR,73.20,Legit2016-01-03 02:53:32.231,78ueoau32,w234e989,1,USA,1621.00,Fraud2016-01-03 09:30:16.832,t842uocd,9732498oeu,4,USA,43.19,Legit2016-01-04 23:01:52.920,t842uocd,cza8873bm,10,MX,159.65,Legit2016-01-05 02:28:10.648,t842uocd,fgcq9803,6,CAN,26.33,Fraud2016-01-05 10:15:36.483,rgc707ke3,tn342v7,2,USA,-0.90,Legit

Schema

Schema inputDataSchema = new Schema.Builder()                .addColumnString("DateTimeString")                .addColumnString("CustomerID")                .addColumnString("MerchantID")                .addColumnInteger("NumItemsInTransaction")                .addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))                .addColumnDouble("TransactionAmountUSD",0.0,null,false,false)                .addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))                .build();

读取文件

try(RecordReader rr = new CSVRecordReader();){    rr.initialize(new FileSplit(new File(csvFilePath)));    List<List<Writable>> records = new ArrayList<List<Writable>>();    while (rr.hasNext()) {        List<Writable> next = rr.next();        if(next != null)            records.add(next);    }    System.out.println(records);} catch (IOException | InterruptedException e) {    e.printStackTrace();}

注意事项

可能细心的小伙伴已经发现了，虽然定义了Schema，但是在读取文件的代码中，并未涉及到对Schema的引用，因此不难发现单纯的定义Schema，是没有任何意义的。
Schema的使用需要搭配TransformProcess，下篇文章会对TransformProcess的使用进行说明。

阅读全文

0 0