custom map/reduce scripts in hive
来源:互联网 发布:连接我家有wifi的网络 编辑:程序博客网 时间:2024/06/05 09:15
First, I have to say that after using Hive for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back. If you are writing straight hadoop jobs for any kind of report, please give hive a shot. You'll thank me.
Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.
1. You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.
2. You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.
This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:
I wanted to transform these records into a 2 column table of k/v:
I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:
Another thing to mention is that you can't run a reduce without first doing a map.
The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive. One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself. Unlike a hadoop reducer where you get a (K key, Iterator<V> values), here you just get row after row of columns.
We will attempt to condense a table (kv_input) that looks like:
into one (kv_condensed) that looks like:
The reduce script
As you can see, a lot of the work goes in to just keeping track of when the keys change.
The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive. Just call your script and pass in some example text separated by tabs. If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).
First, we need to add our map and reduce scripts:
Now for the real work:
This is fairly dense, so I will attempt to give a line by line breakdown:
On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).
As I mentioned, You must specify a map script in order to reduce. For this example, we're just using a simple identity perl script. On line 5 we name the columns the map script will output.
Line 6 specifies the column which is the key. This is how the rows will be sorted when passed to your reduce script.
Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).
Finally, line 10 names the output columns from our reducer.
(Here's my full hive session for this example, and an example input file).
I hope this was helpful. Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.
Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.
The basics
1. You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.
2. You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.
An example
This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:
k1=v1,k2=v2,k3=v3,...k1=v1,k2=v2,k3=v3,...k1=v1,k2=v2,k3=v3,...
I wanted to transform these records into a 2 column table of k/v:
k1v1k2v2k3v3k1v1k2v2...
I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:
-- add script to distributed cacheadd file /tmp/split_kv.pl;
-- run transforminsert overwrite table test_kv_splitselect transform (d.kvs) using './split_kv.pl' as (k, v)from (select all_kvs as kvs from kv_input) d;
As you can see, you can specify both the input and output columns as part of your transform statement.
split_kv.pl
#!/usr/bin/perl while (<STDIN>) { my $line = $_; chomp($line); my @kvs = split(/,/, $line); foreach my $p (@kvs) { my @kv = split(/=/, $p); print $kv[0] . "\t" . $kv[1] . "\n"; }}
And... that's all there is to it. Next time... a reducer?
===============分割线============
In a previous post, I discussed writing custom map scripts in hive. Now, let's talk about reduce tasks.
The basics
As before, you are not writing an org.apache.hadoop.mapred.Reducer class. Your reducer is just a simple script that reads from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).Another thing to mention is that you can't run a reduce without first doing a map.
The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive. One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself. Unlike a hadoop reducer where you get a (K key, Iterator<V> values), here you just get row after row of columns.
An example
We'll use a similar example to the map script.We will attempt to condense a table (kv_input) that looks like:
k1 v1k2 v1k4 v1k2 v3k3 v1k1 v2k4 v2k2 v2...
into one (kv_condensed) that looks like:
k1 v1,v2k2 v1,v2,v3...
The reduce script
Please forgive my perl. It's been a long time (I usually write these in java, but thought perl would make for an easier blog example).#!/usr/bin/perl undef $currentKey;@vals=();while (<STDIN>) { chomp(); processRow(split(/\t/));}output();sub output() { print $currentKey . "\t" . join(",", sort @vals) . "\n";}sub processRow() { my ($k, $v) = @_; if (! defined($currentKey)) { $currentKey = $k; push(@vals, $v); return; } if ($currentKey ne $k) { output(); $currentKey = $k; @vals=($v); return; } push(@vals, $v);}
As you can see, a lot of the work goes in to just keeping track of when the keys change.
The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive. Just call your script and pass in some example text separated by tabs. If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).
Reducing from Hive
Okay, now that we have our reduce script working, let's run it from Hive.First, we need to add our map and reduce scripts:
add file identity.pl;add file condense.pl;
Now for the real work:
from ( from kv_input MAP k, v USING './identity.pl' as k, v cluster by k) map_outputinsert overwrite table kv_condensedreduce k, v using './condense.pl' as k, v;
This is fairly dense, so I will attempt to give a line by line breakdown:
On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).
As I mentioned, You must specify a map script in order to reduce. For this example, we're just using a simple identity perl script. On line 5 we name the columns the map script will output.
Line 6 specifies the column which is the key. This is how the rows will be sorted when passed to your reduce script.
Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).
Finally, line 10 names the output columns from our reducer.
(Here's my full hive session for this example, and an example input file).
I hope this was helpful. Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.
Ref: http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html
http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html
- custom map/reduce scripts in hive
- hive中的map/reduce
- hive优化 map+reduce+split
- HIVE的MAP/REDUCE原理
- Map and Reduce (in python)
- Hive学习之Transform/Map-Reduce语法
- hive (map和reduce 个数的设定 )
- hive中map和reduce优化
- Hive 设置map 和 reduce 的个数
- Hive 设置map 和 reduce 的个数
- hive 的map reduce的设置
- Hive 设置map 和 reduce 的个数
- Hive 设置map 和 reduce 的个数
- hive计算map数和reduce数
- Hive 设置map 和 reduce 的个数
- HIVE文件格式的map reduce代码编写
- Hive设置map和reduce的个数
- InstantiationException in hadoop map reduce program
- HashMap元素按value(double)排序
- S3c6410 linux内核移植(7)---添加8.9寸lcd支持
- 最好去斑美白偏方
- Android URL上传图片
- S3c6410 linux内核移植(8)---添加pwm驱动,控制lcd背光亮度
- custom map/reduce scripts in hive
- EditText高度的改变
- undefined reference to 'pthread_create' 错误
- 不确定性原理的前世今生 · 数学篇(一)
- S3c6410 linux内核移植(8)---添加adc驱动、电池电量检测之原理
- tomcat jndi + spring配置
- 查找链表中倒数第k个结点
- Rubygems 镜像 - 淘宝网
- 12-3-16关于c++标准库bitset