Hadoop Streaming Made Simple using Joins and Keys with Python
来源:互联网 发布:淘宝 君仔体育 编辑:程序博客网 时间:2024/05/17 23:35
There are a lot of different ways to write MapReduce jobs!!!
Sample code for this post https://github.com/joestein/amaunet
I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).
When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.
I like working under the hood myself and getting down and dirty with the data and here is how you can too.
Lets start first with defining two simple sample data sets.
Data set 1: countries.dat
name|key
United States|US
Canada|CA
United Kingdom|UK
Italy|IT
Data set 2: customers.dat
name|type|country
Alice Bob|not bad|US
Sam Sneed|valued|CA
Jon Sneed|valued|CA
Arnold Wesise|not so good|UK
Henry Bob|not bad|US
Yo Yo Ma|not so good|CA
Jon York|valued|CA
Alex Ball|valued|UK
Jim Davis|not so bad|JA
The requirements: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).
To-do this you need to:
1) Join the data sets2) Key on country3) Count type of customer per country4) Output the results
So first lets code up a quick mapper called smplMapper.py (you can decide if smpl is short for simple or sample).
Now in coding the mapper and reducer in Python the basics are explained nicely herehttp://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ but I am going to dive a bit deeper to tackle our example with some more tactics.
#!/usr/bin/env python
import
sys
# input comes from STDIN (standard input)
for
line
in
sys.stdin:
try
:
#sometimes bad data can cause errors use this how you like to deal with lint and bad data
personName
=
"-1"
#default sorted as first
personType
=
"-1"
#default sorted as first
countryName
=
"-1"
#default sorted as first
country2digit
=
"-1"
#default sorted as first
# remove leading and trailing whitespace
line
=
line.strip()
splits
=
line.split(
"|"
)
if
len
(splits)
=
=
2
:
#country data
countryName
=
splits[
0
]
country2digit
=
splits[
1
]
else
:
#people data
personName
=
splits[
0
]
personType
=
splits[
1
]
country2digit
=
splits[
2
]
print
'%s^%s^%s^%s'
%
(country2digit,personType,personName,countryName)
except
:
#errors are going to make your job fail which you may or may not want
pass
Don’t forget:
chmod a+x smplMapper.py
Great! We just took care of #1 but time to test and see what is going to the reducer.
From the command line run:
cat customers.dat countries.dat|./smplMapper.py|sort
Which will result in:
CA^-1^-1^Canada
CA^not so good^Yo Yo Ma^-1
CA^valued^Jon Sneed^-1
CA^valued^Jon York^-1
CA^valued^Sam Sneed^-1
IT^-1^-1^Italy
JA^not so bad^Jim Davis^-1
UK^-1^-1^United Kingdom
UK^not so good^Arnold Wesise^-1
UK^valued^Alex Ball^-1
US^-1^-1^United States
US^not bad^Alice Bob^-1
US^not bad^Henry Bob^-1
Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)
Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get smplReducer.py working first.
#!/usr/bin/env python
import
sys
# maps words to their counts
foundKey
=
""
foundValue
=
""
isFirst
=
1
currentCount
=
0
currentCountry2digit
=
"-1"
currentCountryName
=
"-1"
isCountryMappingLine
=
False
# input comes from STDIN
for
line
in
sys.stdin:
# remove leading and trailing whitespace
line
=
line.strip()
try
:
# parse the input we got from mapper.py
country2digit,personType,personName,countryName
=
line.split(
'^'
)
#the first line should be a mapping line, otherwise we need to set the currentCountryName to not known
if
personName
=
=
"-1"
:
#this is a new country which may or may not have people in it
currentCountryName
=
countryName
currentCountry2digit
=
country2digit
isCountryMappingLine
=
True
else
:
isCountryMappingLine
=
False
# this is a person we want to count
if
not
isCountryMappingLine:
#we only want to count people but use the country line to get the right name
#first check to see if the 2digit country info matches up, might be unkown country
if
currentCountry2digit !
=
country2digit:
currentCountry2digit
=
country2digit
currentCountryName
=
'%s - Unkown Country'
%
currentCountry2digit
currentKey
=
'%s\t%s'
%
(currentCountryName,personType)
if
foundKey !
=
currentKey:
#new combo of keys to count
if
isFirst
=
=
0
:
print
'%s\t%s'
%
(foundKey,currentCount)
currentCount
=
0
#reset the count
else
:
isFirst
=
0
foundKey
=
currentKey
#make the found key what we see so when we loop again can see if we increment or print out
currentCount
+
=
1
# we increment anything not in the map list
except
:
pass
try
:
print
'%s\t%s'
%
(foundKey,currentCount)
except
:
pass
Don’t forget:
chmod a+x smplReducer.py
And then run:
cat customers.dat countries.dat|./smplMapper.py|sort|./smplReducer.py
And voila!
Canada not so good 1
Canada valued 3
JA - Unkown Country not so bad 1
United Kingdom not so good 1
United Kingdom valued 1
United States not bad 2
So now #3 and #4 are done but what about #2?
First put the files into Hadoop:
hadoop fs -put ~/mayo/customers.dat .
hadoop fs -put ~/mayo/countries.dat .
And now run it like this (assuming you are running as hadoop in the bin directory):
hadoop jar ../contrib/streaming/hadoop-0.20.1+169.89-streaming.jar -D mapred.reduce.tasks=4 -file ~/mayo/smplMapper.py -mapper smplMapper.py -file ~/mayo/smplReducer.py -reducer smplReducer.py -input customers.dat -input countries.dat -output mayo -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -jobconf stream.map.output.field.separator=^ -jobconf stream.num.map.output.key.fields=4 -jobconf map.output.key.field.separator=^ -jobconf num.key.fields.for.partition=1
Let us look at what we did:
hadoop fs -cat mayo/part*
Which results in:
Canada not so good 1
Canada valued 3
United Kingdom not so good 1
United Kingdom valued 1
United States not bad 2
JA - Unkown Country not so bad 1
So #2 is the partioner KeyFieldBasedPartitioner explained here further Hadoop Wiki On Streaming which allows the key to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the values are sorted within that key and sent to the reducer together by key.
And there you go … Simple Python Scripting Implementing Streaming in Hadoop.
Grab the tar here and give it a spin.
/*
Joe Stein
Twitter: @allthingshadoop
Connect: On Linked In
*/
- Hadoop Streaming Made Simple using Joins and Keys with Python
- Dremel made simple with Parquet
- Dremel made simple with Parquet
- Simple Web Application using Cherrypy in Python 3.3 with MySQL
- Hadoop Streaming for Python
- HTTP Live Streaming with a Webcam on Linux using VLC and a Segmenter
- 如何使用MySQL Joins and More ORDER BY With LIMIT
- Hadoop Streaming Input and Output
- A Beginner Tutorial for Writing Simple COM/ATL DLL and Using it with .NET
- SQL - Using Inner Joins
- SQL - Using Outer Joins
- SQL - Using Cross Joins
- Face Detection(OpenCV) Using Hadoop Streaming API
- Face Recognition(face_recognition) Using Hadoop Streaming API
- Using Rsync and SSH Keys, Validating, and Automation
- 使用Gtreamer获得摄像头数据并显示(Webcam streaming using Python--pyGTK, wxPython and Gstreamer)
- python Hadoop Streaming程序测试
- Performance of Using Keys in SELECT with FOR ALL ENTRIES
- C语言数组相关知识点
- 红帽中文乱码
- java 把小数格式化成固定小数位数的几种方法---含实例
- 通达OA工作流主要表的数据结构
- CopyWithZone:关于深拷贝/浅拷贝
- Hadoop Streaming Made Simple using Joins and Keys with Python
- WinCE配置文件详解之(二)--CEC文件
- 通达OA二次开发-随时获取工作流中的数据
- Intel Network Drivers for SunSoft Solaris*
- C++中的const(一)——和C中的区别
- 三角肌前束(04):杠铃立正划船
- 知识总结-Java 操作 Excel (读取Excel2003 2007,Poi实现)
- 虚函数表指针和虚继承
- 可靠的UDP编程(ENET库)