mr项目优化总结

来源:互联网 发布:互联网共享打印机端口 编辑:程序博客网 时间:2024/05/16 12:36

---------------------------------mr运行参数调优

MapReduce任务参数调优

Hadoop优化 第一篇 : HDFS/MapReduce

MapReduce相关参数

MapReduce官方文档

以上三篇可以作为内部调优的参考,但是个人感觉,参数调优适用于平台内部调优,如果对mr没有深层次的了解,盲目调节,反而适得其反

代码中参数调节方式:

configuration.setDouble(Job.SHUFFLE_INPUT_BUFFER_PERCENT, 0.25);

---------------------------------减少不必要的reduce

map过后是copy  merge reduce ,流程多,耗资源,如果仅仅是为了取得一些数据,不需要归约,做计算的话,就没有必要用reduce

可以如下设置取消reduce:

job.setNumReduceTasks(0);

mapReduce具体执行流程参考:

Hadoop Map/Reduce执行流程详解

Hadoop MapReduce执行过程详解


取消reduce后,可以在map task中进行的多路径输出:

public class ClearDataFixMapper extends Mapper<LongWritable, Text, LongWritable, Text> {private MultipleOutputs<LongWritable, Text> mos;private static final String SEP_CLMN = BaseConstant.DATA_SEPARATOR_COLUMN;@Overrideprotected void setup(Context context) throws IOException, InterruptedException {mos = new MultipleOutputs<LongWritable, Text>(context);}@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {mos.write(key, resultVal, uri);}@Overrideprotected void cleanup(Mapper<LongWritable, Text, LongWritable, Text>.Context context) throws IOException, InterruptedException {mos.close();}}

那么问题来了,由于map每个map task中 mos都是单独输出的,一旦输出数据量变大,就会随之而来小文件过多的问题,对此,有两种解决方案,一种是通过hadoop 的IO,在task执行之前聚合数据,但是此方法有缺陷,就是不能方便地聚合sequencefile,还有一个缺陷就是,对namenode的请求过多,可能会报RemoteException。聚合方式如下:

//数据量统计以及小文件整合long blockSize = 0;long totalSize=0;FileStatus[] statuses = hadoopFs.globStatus(new Path(inputPath + "*"));hadoopFs.createNewFile(new Path(inputPath+"infile"));FSDataOutputStream outstream = hadoopFs.create(new Path(inputPath+"infile"),true);FSDataInputStream inputStream=null;for (FileStatus status : statuses) {totalSize += status.getLen();blockSize = status.getBlockSize();if(!status.getPath().toString().contains("infile")){inputStream=hadoopFs.open(status.getPath());IOUtils.copyBytes(inputStream, outstream, configuration,false);inputStream.close();outstream.flush();//聚合以后删除被聚合的文件hadoopFs.delete(status.getPath(), true);}}if(outstream!=null){outstream.close();}

第二种方式是使用combinesequencefile:

//注意文件路径如果是文件夹,文件结尾不要加 /MultipleInputs.addInputPath(job,new Path(inputPath), CombineSequenceFileInputFormat.class);//map split大小,必须设置,否则就是一个 map taskFileInputFormat.setMinInputSplitSize(job, 1);FileInputFormat.setMaxInputSplitSize(job, 1024*1024*64);

---------------------------------改进数据发送部分:大批量数据发送时,一定要注意关闭关闭资源,这里数据发送使用httpclient,demo如下:

public class KVSender2 {private static Log log = LogFactory.getLog(KVSender2.class);public static void KVSender(String url, Map<String, String> params) {// String result = "";CloseableHttpClient httpClient = null;CloseableHttpResponse response = null;HttpGet request = new HttpGet(url);request.setConfig(RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build());// 获取当前客户端对象httpClient = HttpClients.createDefault();// 通过请求对象获取响应对象try {for (Map.Entry<String, String> entry : params.entrySet()) {request.setHeader(entry.getKey(), entry.getValue());}response = httpClient.execute(request);// 判断网络连接状态码是否正常(0--200都数正常)// System.out.println(response.getStatusLine().getStatusCode());if (response.getStatusLine().getStatusCode() > HttpStatus.SC_OK) {// 出错就重发送一次httpClient.execute(request);log.info("status code is " + response.getStatusLine().getStatusCode());// result = EntityUtils.toString(response.getEntity(), "utf-8");}EntityUtils.consume(response.getEntity());} catch (ClientProtocolException e) {e.printStackTrace();} catch (ParseException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {// 释放资源if (response != null) {try {response.close();} catch (IOException e) {e.printStackTrace();}}if (httpClient != null) {try {httpClient.close();} catch (IOException e) {e.printStackTrace();}}request.releaseConnection();}}}


0 0
原创粉丝点击