hadoop文本转换为序列文件
来源:互联网 发布:stcisp正在检测单片机 编辑:程序博客网 时间:2024/06/07 15:43
在以前使用hadoop的时候因为mahout里面很多都要求输入文件时序列文件,所以涉及到把文本文件转换为序列文件或者序列文件转为文本文件(因为当时要分析mahout的源码,所以就要看到它的输入文件是什么,文本比较好看其内容)。一般这个有两种做法,其一:按照《hadoop权威指南》上面的方面直接读出序列文件然后写入一个文本;其二,编写一个job任务,直接设置输出文件的格式,这样也可以把序列文件读成文本(个人一般采用这样方法)。时隔好久,今天又重新试了下,居然不行了?,比如,我要编写一个把文本转为序列文件的java程序如下:
package mahout.fansy.canopy.transformdata;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;import org.apache.mahout.common.AbstractJob;import org.apache.mahout.math.RandomAccessSparseVector;import org.apache.mahout.math.Vector;import org.apache.mahout.math.VectorWritable;public class Text2VectorWritable extends AbstractJob{ @Overridepublic int run(String[] arg0) throws Exception {addInputOption(); addOutputOption(); if (parseArguments(arg0) == null) { return -1;} Path input=getInputPath(); Path output=getOutputPath(); Configuration conf=getConf(); Job job=new Job(conf,"text2vectorWritable with input:"+input.getName()); // job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(Text2VectorWritableMapper.class); job.setMapOutputKeyClass(Writable.class); job.setMapOutputValueClass(VectorWritable.class); job.setNumReduceTasks(0); job.setJarByClass(Text2VectorWritable.class); FileInputFormat.addInputPath(job, input); SequenceFileOutputFormat.setOutputPath(job, output); if (!job.waitForCompletion(true)) { throw new InterruptedException("Canopy Job failed processing " + input); }return 0;}public static class Text2VectorWritableMapper extends Mapper<Writable,Text,Writable,VectorWritable>{public void map(Writable key,Text value,Context context)throws IOException,InterruptedException{String[] str=value.toString().split(",");Vector vector=new RandomAccessSparseVector(str.length);for(int i=0;i<str.length;i++){vector.set(i, Double.parseDouble(str[i]));}VectorWritable va=new VectorWritable(vector);context.write(key, va);}}}这样在运行的时候老是提示说 我的Map的value的类型不是Text,不管我设置为什么类型都会是这样的情况。后来我就想会不会是map的输出时Text的格式?,然后我就把上面的程序加入了Reducer,如下:
package mahout.fansy.canopy.transformdata;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;import org.apache.mahout.common.AbstractJob;import org.apache.mahout.math.RandomAccessSparseVector;import org.apache.mahout.math.Vector;import org.apache.mahout.math.VectorWritable;public class Text2VectorWritableCopy extends AbstractJob{ @Overridepublic int run(String[] arg0) throws Exception {addInputOption(); addOutputOption(); if (parseArguments(arg0) == null) { return -1;} Path input=getInputPath(); Path output=getOutputPath(); Configuration conf=getConf(); Job job=new Job(conf,"text2vectorWritableCopy with input:"+input.getName()); // job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(Text2VectorWritableMapper.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(VectorWritable.class); job.setReducerClass(Text2VectorWritableReducer.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(VectorWritable.class); job.setJarByClass(Text2VectorWritableCopy.class); FileInputFormat.addInputPath(job, input); SequenceFileOutputFormat.setOutputPath(job, output); if (!job.waitForCompletion(true)) { throw new InterruptedException("Canopy Job failed processing " + input); }return 0;}public static class Text2VectorWritableMapper extends Mapper<LongWritable,Text,LongWritable,VectorWritable>{public void map(LongWritable key,Text value,Context context)throws IOException,InterruptedException{String[] str=value.toString().split(",");Vector vector=new RandomAccessSparseVector(str.length);for(int i=0;i<str.length;i++){vector.set(i, Double.parseDouble(str[i]));}VectorWritable va=new VectorWritable(vector);context.write(key, va);}}public static class Text2VectorWritableReducer extends Reducer<LongWritable,VectorWritable,LongWritable,VectorWritable>{public void reduce(LongWritable key,Iterable<VectorWritable> values,Context context)throws IOException,InterruptedException{for(VectorWritable v:values){context.write(key, v);}}}}然后在运行,就可以了。
不过关于map的输出是否一定是text格式的,还有待论证。
- Hadoop文本转换为序列文件
- hadoop文本转换为序列文件
- 文本转换为声音
- 文本转换为GIF
- 文本转换为图片
- XY坐标文本数据转换为Shp文件
- 使用QGIS将文本坐标转换为矢量文件
- (ms_Excel+wps_ET)小额支付文件转换为Unix文本工具
- 文本转换为CRF ++ 格式
- 将文本转换为数组。
- 数字转换为英文文本
- 图片转换为文本样式
- 将Xml文件从文本格式转换为二进制格式可以划分为六个步骤
- 将视频序列转换为图像序列
- Hadoop序列化文件SequenceFile
- Hadoop序列化文件SequenceFile
- hadoop文件的序列化
- Linux下批量把PDF文件转换为txt文本的小程序
- 用with改写优化sql之二
- 八皇后C++完整程序
- { 凸包 }
- mahout源码KMeansDriver分析之一整体分析
- android R.java文件丢失或无法更新
- hadoop文本转换为序列文件
- UtilDate(四)获取前月1号日期
- linux开发板截图程序
- mahout源码canopy算法分析之三CanopyReducer
- 在Flex库项目中使用defaults.css文件
- Linux ls文件夹颜色(蓝色)的改变方法
- chmod 4755和chmod 755的区别
- POJ 2891 扩展欧几里德
- mahout源码canopy算法分析之二CanopyMapper