custom map/reduce scripts in hive

来源：互联网发布：连接我家有wifi的网络编辑：程序博客网时间：2024/06/05 09:15

First, I have to say that after using Hive for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back. If you are writing straight hadoop jobs for any kind of report, please give hive a shot. You'll thank me.

Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.

The basics

1. You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.

2. You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.

An example

This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:

k1=v1,k2=v2,k3=v3,...k1=v1,k2=v2,k3=v3,...k1=v1,k2=v2,k3=v3,...

I wanted to transform these records into a 2 column table of k/v:

k1v1k2v2k3v3k1v1k2v2...

I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:

-- add script to distributed cacheadd file /tmp/split_kv.pl;

-- run transforminsert overwrite table test_kv_splitselect  transform (d.kvs)    using './split_kv.pl'    as (k, v)from  (select all_kvs as kvs from kv_input) d;

As you can see, you can specify both the input and output columns as part of your transform statement.

split_kv.pl

#!/usr/bin/perl                                                                 while (<STDIN>) {    my $line = $_;    chomp($line);    my @kvs = split(/,/, $line);    foreach my $p (@kvs) {        my @kv = split(/=/, $p);        print $kv[0] . "\t" . $kv[1] . "\n";    }}

And... that's all there is to it. Next time... a reducer?

===============分割线============

In a previous post, I discussed writing custom map scripts in hive. Now, let's talk about reduce tasks.

The basics

As before, you are not writing an org.apache.hadoop.mapred.Reducer class. Your reducer is just a simple script that reads from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).

Another thing to mention is that you can't run a reduce without first doing a map.

The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive. One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself. Unlike a hadoop reducer where you get a (K key, Iterator<V> values), here you just get row after row of columns.

An example

We'll use a similar example to the map script.

We will attempt to condense a table (kv_input) that looks like:

k1 v1k2 v1k4 v1k2 v3k3 v1k1 v2k4 v2k2 v2...

into one (kv_condensed) that looks like:

k1 v1,v2k2 v1,v2,v3...

The reduce script

#!/usr/bin/perl                                                                                       undef $currentKey;@vals=();while (<STDIN>) {  chomp();  processRow(split(/\t/));}output();sub output() {  print $currentKey . "\t" . join(",", sort @vals) . "\n";}sub processRow() {  my ($k, $v) = @_;  if (! defined($currentKey)) {    $currentKey = $k;    push(@vals, $v);    return;  }  if ($currentKey ne $k) {    output();    $currentKey = $k;    @vals=($v);    return;  }  push(@vals, $v);}

Please forgive my perl. It's been a long time (I usually write these in java, but thought perl would make for an easier blog example).

As you can see, a lot of the work goes in to just keeping track of when the keys change.

The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive. Just call your script and pass in some example text separated by tabs. If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).

Reducing from Hive

Okay, now that we have our reduce script working, let's run it from Hive.

First, we need to add our map and reduce scripts:

add file identity.pl;add file condense.pl;

Now for the real work:

from (  from kv_input  MAP k, v  USING './identity.pl'  as k, v cluster by k) map_outputinsert overwrite table kv_condensedreduce k, v  using './condense.pl'  as k, v;

This is fairly dense, so I will attempt to give a line by line breakdown:

On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).

As I mentioned, You must specify a map script in order to reduce. For this example, we're just using a simple identity perl script. On line 5 we name the columns the map script will output.

Line 6 specifies the column which is the key. This is how the rows will be sorted when passed to your reduce script.

Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).

Finally, line 10 names the output columns from our reducer.

(Here's my full hive session for this example, and an example input file).

I hope this was helpful. Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.

Ref: http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html

http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html