MapReduce处理xml文件(使用旧API)

来源:互联网 发布:windows 10激活 编辑:程序博客网 时间:2024/06/08 00:42

1)MapReduce项目引入jar包:hadoop-streaming-2.6.5.jar
2)main函数主要代码段:

JobConf jobconf = new JobConf(new Configuration(), MreMroParser.class);jobconf.setJobName("xmlParser");//这里标记使用流式输入jobconf.set("stream.recordreader.class",StreamXmlRecordReader.class.getName());//开始标记为<bulkPmMrDataFile>jobconf.set("stream.recordreader.begin", "<bulkPmMrDataFile>");//结束标记为</bulkPmMrDataFile>jobconf.set("stream.recordreader.end", "</bulkPmMrDataFile>"); // 设置reduce的输出结果key和value用逗号分隔jobconf.set("mapred.textoutputformat.ignoreseparator", "true");  jobconf.set("mapred.textoutputformat.separator", ",");jobconf.setMapperClass(xmlParserMapper.class);  jobconf.setReducerClass(xmlParserReducer.class); // 设置inputFormat            jobconf.setInputFormat(StreamInputFormat.class);  jobconf.setOutputFormat(TextOutputFormat.class); jobconf.setOutputKeyClass(Text.class);  jobconf.setOutputValueClass(Text.class);  MultipleInputs.addInputPath(jobconf, new Path(args[0]), StreamInputFormat.class,MreMroParserMapper.class);  FileOutputFormat.setOutputPath(jobconf, new Path(args[1])); JobClient.runJob(jobconf);

3)Map函数xmlParserMapper.class核心代码:

public class MreMroParserMapper  extends MapReduceBase implements Mapper<Text, Text, Text, Text> {  @Override  /*   * Context实例用于输出内容的写入   * (non-Javadoc)   * @see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN, org.apache.hadoop.mapreduce.Mapper.Context)   */  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter)      throws IOException {    String xmlContent= key.toString();    System.out.println("'" + xmlContent+ "'");/*自定义XML解析函数,将xmlContent送入*/………………我是使用dom4j:Document document = DocumentHelper.parseText(xmlContent); Element elementRoot = document.getRootElement();解析后返回多记录List resultDatas………………处理多记录输出:for(int i=0;i<resultDatas.size();i++){        String data = dataFormater.formatResultData(resultDatas.get(i));        Text text = new Text();        text.set(data);        output.collect(new Text(resultDatas.get(i).getId()), text);}
0 0
原创粉丝点击