利用python实现hadoop stream操作遇到的Broken pipe错误

来源:互联网 发布:java装饰模式好处 编辑:程序博客网 时间:2024/06/05 09:11

最近在学习hadoop stream,比较习惯用python写mapper和reducer,同时开始感觉到使用python的便利。

但是在运行过程中,经常会出现在mapper进行到某处,突然跳到100%。并提示任务FAIL。

查看详细日志,有如下类似的错误提示:

java.io.IOException: Broken pipe    at java.io.FileOutputStream.writeBytes(Native Method)    at java.io.FileOutputStream.write(FileOutputStream.java:260)    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)    at java.io.DataOutputStream.write(DataOutputStream.java:90)    at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)    at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)    at org.apache.hadoop.mapred.Child.main(Child.java:170)    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)    at org.apache.hadoop.mapred.Child.main(Child.java:170)

很纠结,上stackoverflow找答案,大家都说的很玄乎,各种问题都有。但我分析,感觉比较靠谱的就是mapper在读输入的时候,没有对脏数据进行处理,导致IO错误,Broken pipe.

因此我对我的python代码加上了异常处理,果不其然,可以正常运行了,另外,try语句的位置也很关键!

def main(argv):    for line in sys.stdin:        try:            new_line = line.strip()            if len(new_line) != 0:                cols = new_line.split('\t')                if len(cols) == 7:                    hash_code = getHashCode32(cols[0])                    if hash_code == cols[6]:                        print new_line        except Exception, ex:             passif __name__ == "__main__":    main(sys.argv)


原创粉丝点击