hive中使用正则表达式不当导致运行奇慢无比

来源：互联网发布：不想上班知乎编辑：程序博客网时间：2024/05/29 21:28

业务保障部有一个需求，需要用hive实时计算上一小时的数据，比如现在是12点，我需要计算11点的数据，而且必须在1小时之后运行出来，但是他们用hive实现的时候发现就单个map任务运行都超过了1小时，根本没法满足需求，后来打电话让我帮忙优化一下，以下是优化过程：

1、hql语句：

CREATE TABLE weibo_mobile_nginx AS SELECTsplit(split(log, '`') [ 0 ], '\\|')[ 0 ] HOST,split(split(log, '`') [ 0 ], '\\|')[ 1 ] time,substr(split(split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' ')[ 0 ], 2)request_type,split(split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' ')[ 1 ] interface,regexp_extract(log,’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__<span style="font-family: Arial, Helvetica, sans-serif;">[^&]*</span>’,3)version,regexp_extract(log,’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__.* ',1) systerm,regexp_extract(log,’.*&networktype=([^&%]*).*',1)net_type,split(log, '`')[ 4 ] STATUS,split(log, '`')[ 5 ] client_ip,split(log, '`')[ 6 ] uid,split(log, '`')[ 8 ] request_time,split(log, '`')[ 12 ] request_uid,split(log, '`')[ 13 ] http_host,split(log, '`')[ 15 ] upstream_response_time,split(log, '`')[ 16 ] idcFROMods_wls_wap_base_origWHEREdt = '20150311'AND HOUR = '08'AND(split(log, '`')[ 13 ]= 'api.weibo.cn'OR split(log, '`')[ 13 ]= 'mapi.weibo.cn’);

其实这个hql很简单，从一个只有一列数据的表ods_wls_wap_base_orig中获取数据，然后对每一行数据进行split或者正则表达式匹配得到需要的字段信息，最后通过输出的数据创建weibo_mobile_nginx表。

其中表ods_wls_wap_base_orig的一行数据格式如下：

web043.mweibo.yhg.sinanode.com|[11/Mar/2015:00:00:01 +0800]`-`"GET /2/remind/unread_count?v_f=2&c=android&wm=9847_0002&remind_version=0&with_settings=1&unread_message=1&from=1051195010&lang=zh_CN&skin=default&with_page_group=1&i=4acbdd0&s=6b2cd11c&gsid=4uQ15a2b3&ext_all=0&idc=&ua=OPPO-R8007__weibo__5.1.1__android__android4.3&oldwm=9893_0028 HTTP/1.1"`"R8007_4.3_weibo_5.1.1_android"`200`[121.60.78.23]`3226234350`"-"`0.063`351`-`121.60.78.23`1002792675011956002`api.weibo.cn`-`0.063`yhg20150311 00

只有1列，列名是log。

2、既然hql实现很慢，我第一次优化的尝试就是写mapreduce

map代码如下：

public class Map extends Mapper<LongWritable, Text, Text, Text> {  private Text outputKey = new Text();  private Text outputValue = new Text();  Pattern p_per_client = Pattern      .compile(".*&ua=[^_]*__([^_]*)__([^_]*)__([^_]*)__[^&]*");  Pattern net_type_parent = Pattern.compile(".*&networktype=([^&%]*).*");  public void map(LongWritable key, Text value, Context context)      throws IOException, InterruptedException {    String[] arr = value.toString().split("`");    if (arr[13].equals("api.weibo.cn") || arr[13].equals("mapi.weibo.cn")) {      Matcher matcher = p_per_client.matcher(value.toString());      String host = "";      String time = "";      String request_type = "";      String interface_url = "";      String version = "";      String systerm = "";      String net_type = "";      String status = "";      String client_ip = "";      String uid = "";      String request_time = "0";      String request_uid = "";      String http_host = "";      String upstream_response_time = "0";      String idc = "";      host = arr[0].split("\\|")[0];      time = arr[0].split("\\|")[1];      request_type = arr[2].split("\\?")[0].split(" ")[0].substring(1);      interface_url = arr[2].split("\\?")[0].split(" ")[1];      if (matcher.find()) {        version = matcher.group(1);        systerm = matcher.group(2);      }      Matcher matcher_net = net_type_parent.matcher(value.toString());      if (matcher_net.find()) {        net_type = matcher_net.group(1);      }      status = arr[4];      client_ip = arr[5];      uid = arr[6];      if (!arr[8].equals("-")) {        request_time = arr[8];      }      request_uid = arr[12];      http_host = arr[13];      if (!arr[15].equals("-")) {        upstream_response_time = arr[15];      }      idc = arr[16];      outputKey.set(host + "\t" + time + "\t" + request_type + "\t"          + interface_url + "\t" + version + "\t" + systerm + "\t" + net_type          + "\t" + status + "\t" + client_ip + "\t" + uid + "\t" + request_uid          + "\t" + http_host + "\t" + idc);      outputValue.set(request_time + "\t" + upstream_response_time);      context.write(outputKey, outputValue);    }  }

java代码其实也很简单，这里不多说。打包提交job，结果map最慢的运行了40分钟，平均map运行时间达到30分钟，虽然整个job在1小时内完成了，但是也很慢，这个问题看来不是用java改写就能好的问题。

3、最后检测正则表达式

改用java实现的mapreduce运行也很慢，看来问题还是其他原因，我看了一下hql中的正则表达式，修改了几个地方：

原来的：

regexp_extract(                log,                ’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__[^&]*’,                3        )version,        regexp_extract(                log,                ’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__.* ',1)        systerm,regexp_extract(log,’.*&networktype=([^&%]*).*',                1        )net_type,

修改后：

regexp_extract(log,'&ua=[^_]*__[^_]*__([^_]*)__[^_]*__',1)version,regexp_extract(log,'&ua=[^_]*__[^_]*__[^_]*__([^_]*)__',1)systerm,regexp_extract(log,'&networktype=([^&%]*)',1)net_type,

其实匹配目标很明确，所以我把正则表达式前后的".*"去掉了，同时去掉了没必要的group，索引都改成了1。

java代码的正则表达式也进行了修改：

Pattern p_per_client = Pattern      .compile("&ua=[^_]*__[^_]*__([^_]*)__([^_]*)__");  Pattern net_type_parent = Pattern.compile("&networktype=([^&%]*).");

分别提交测试了一下，速度ss的，修改后的hql和mapreduce整个作业6分钟运行完成，平均map运行时间2分钟，速度提升很大，满足了他们的速度要求。

总结：

1、正则表达式最前面包含“.*”，这样在匹配的时候需要从第一个字符开始匹配，速度非常非常慢，如果我们匹配的目标很明确的情况下，应该去掉“.*”

2、以后遇到这种问题的时候，一定要看看正则表达式是不是写得有问题，切记切记。

0 0