mr项目优化总结

来源：互联网发布：互联网共享打印机端口编辑：程序博客网时间：2024/05/16 12:36

---------------------------------mr运行参数调优

MapReduce任务参数调优

Hadoop优化第一篇 : HDFS/MapReduce

MapReduce相关参数

MapReduce官方文档

以上三篇可以作为内部调优的参考，但是个人感觉，参数调优适用于平台内部调优，如果对mr没有深层次的了解，盲目调节，反而适得其反

代码中参数调节方式：

configuration.setDouble(Job.SHUFFLE_INPUT_BUFFER_PERCENT, 0.25);

---------------------------------减少不必要的reduce

map过后是copy merge reduce ，流程多，耗资源，如果仅仅是为了取得一些数据，不需要归约，做计算的话，就没有必要用reduce

可以如下设置取消reduce：

job.setNumReduceTasks(0);

mapReduce具体执行流程参考：

Hadoop Map/Reduce执行流程详解

Hadoop MapReduce执行过程详解

取消reduce后，可以在map task中进行的多路径输出：

public class ClearDataFixMapper extends Mapper<LongWritable, Text, LongWritable, Text> {private MultipleOutputs<LongWritable, Text> mos;private static final String SEP_CLMN = BaseConstant.DATA_SEPARATOR_COLUMN;@Overrideprotected void setup(Context context) throws IOException, InterruptedException {mos = new MultipleOutputs<LongWritable, Text>(context);}@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {mos.write(key, resultVal, uri);}@Overrideprotected void cleanup(Mapper<LongWritable, Text, LongWritable, Text>.Context context) throws IOException, InterruptedException {mos.close();}}

那么问题来了，由于map每个map task中 mos都是单独输出的，一旦输出数据量变大，就会随之而来小文件过多的问题，对此，有两种解决方案，一种是通过hadoop 的IO，在task执行之前聚合数据，但是此方法有缺陷，就是不能方便地聚合sequencefile，还有一个缺陷就是，对namenode的请求过多，可能会报RemoteException。聚合方式如下：

//数据量统计以及小文件整合long blockSize = 0;long totalSize=0;FileStatus[] statuses = hadoopFs.globStatus(new Path(inputPath + "*"));hadoopFs.createNewFile(new Path(inputPath+"infile"));FSDataOutputStream outstream = hadoopFs.create(new Path(inputPath+"infile"),true);FSDataInputStream inputStream=null;for (FileStatus status : statuses) {totalSize += status.getLen();blockSize = status.getBlockSize();if(!status.getPath().toString().contains("infile")){inputStream=hadoopFs.open(status.getPath());IOUtils.copyBytes(inputStream, outstream, configuration,false);inputStream.close();outstream.flush();//聚合以后删除被聚合的文件hadoopFs.delete(status.getPath(), true);}}if(outstream!=null){outstream.close();}

第二种方式是使用combinesequencefile:

//注意文件路径如果是文件夹，文件结尾不要加 /MultipleInputs.addInputPath(job,new Path(inputPath), CombineSequenceFileInputFormat.class);//map split大小，必须设置，否则就是一个 map taskFileInputFormat.setMinInputSplitSize(job, 1);FileInputFormat.setMaxInputSplitSize(job, 1024*1024*64);

---------------------------------改进数据发送部分：大批量数据发送时，一定要注意关闭关闭资源，这里数据发送使用httpclient，demo如下：

public class KVSender2 {private static Log log = LogFactory.getLog(KVSender2.class);public static void KVSender(String url, Map<String, String> params) {// String result = "";CloseableHttpClient httpClient = null;CloseableHttpResponse response = null;HttpGet request = new HttpGet(url);request.setConfig(RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build());// 获取当前客户端对象httpClient = HttpClients.createDefault();// 通过请求对象获取响应对象try {for (Map.Entry<String, String> entry : params.entrySet()) {request.setHeader(entry.getKey(), entry.getValue());}response = httpClient.execute(request);// 判断网络连接状态码是否正常(0--200都数正常)// System.out.println(response.getStatusLine().getStatusCode());if (response.getStatusLine().getStatusCode() > HttpStatus.SC_OK) {// 出错就重发送一次httpClient.execute(request);log.info("status code is " + response.getStatusLine().getStatusCode());// result = EntityUtils.toString(response.getEntity(), "utf-8");}EntityUtils.consume(response.getEntity());} catch (ClientProtocolException e) {e.printStackTrace();} catch (ParseException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();} finally {// 释放资源if (response != null) {try {response.close();} catch (IOException e) {e.printStackTrace();}}if (httpClient != null) {try {httpClient.close();} catch (IOException e) {e.printStackTrace();}}request.releaseConnection();}}}

0 0