运用hadoop计算TF-IDF续-支持中文读取-支持文件输出控制

来源:互联网 发布:mysql 清空数据库命令 编辑:程序博客网 时间:2024/05/16 01:00

jackydai987在文章《运用hadoop计算TF-IDF》(http://blog.csdn.net/jackydai987/article/details/6303459)提到了TF-IDF的计算,但是留下几个问题:

1、       输出的文件控制

2、       对中文字符的处理

为此本文做如下改进:

请参见《HadoopMapReduce中多文件输出http://blog.csdn.net/jackydai987/archive/2011/03/12/6244725.aspx

文中提出了三个类:

My_LineRead.java(修改使用)

MyMultipleOutputFormat.java(保留使用)

以及嵌入到主程序的

MyMultiple.java(修改使用)

1.     public static class MyMultiple extends MyMultipleOutputFormat{  

2.    

3.          @Override  

4.          protected String generateFileNameForKeyValue(WritableComparable key,  

5.                  Writable value, Configuration conf) {  

6.              // TODO Auto-generated method stub  

7.              return "other.txt";  

8.          }  

9.            

10.     } 

具体修改如下:

1、修改MyMultiple.java,保证产生的输出文件名为源文件名

       publicstatic class MyMultiple extends MyMultipleOutputFormat{

              @Override

              protectedString generateFileNameForKeyValue(WritableComparable key,

                            Writablevalue, Configuration conf) {

                     //TODO Auto-generated method stub

                     String[]c = value.toString().split(" "); 

           //以文件名为输出的文件名

                     returnc[0] + ".txt"; 

          }

             

}

2、修改My_LineRead以支持中文读入输出

其他原理参考:http://blog.csdn.net/guoery/article/details/8557094

具体如下:

//private static final String utf8 ="UTF-8";

改为:

private static final String utf8 ="GBK";

 

@Override

       public void write(K key, V value) throwsIOException,

                     InterruptedException {

              if (!(key==null && keyinstanceofNullWritable)){

                     //如果key不为空者输出key

                     if((Object)keyinstanceof Text){

          //下面的注释掉

                     /*

                            Text to = (Text)key;

                            out.write(to.getBytes(),0, to.getLength());

                     }

                     else

                     {

                     */

                            out.write(key.toString().getBytes(utf8));

                     }

                     out.write(keyValueSeparator);

              }

              if(!(value ==null && valueinstanceofNullWritable)){

                     //如果value不为空则输出value

                     if((Object)valueinstanceof Text){

              //下面的注释掉

                            /*    Textto = (Text) value;

                            out.write(to.getBytes(),0, to.getLength());

                     }

                     else

                     {

                     */

                            out.write(value.toString().getBytes(utf8));

                     }

                     out.write(newline);

              }

             

}

3、在主程序中修改:

//设置多文件输出

        job1.setOutputFormatClass(MyMultiple.class);

        Job2.setOutputFormatClass(MyMultiple.class);

输出结果如下图:


原创粉丝点击