HDFS安装及MapReduce(Python)

来源：互联网发布：网络手势的含义带图片编辑：程序博客网时间：2024/05/18 20:33

hdfs安装

安装虚拟机

http://www.powerxing.com/install-hadoop/
创建hadoop用户组和hadoop用户，并给予root权限（此处我没创建，直接用root）
sudo apt-get update
sudo apt-get install openssh-server openssh-client
建立互信（你使用的用户）
安装jdk：sudo apt-get install openjdk-8-jdk 配置JAVA_HOME
安装hadoop: 解压
配置

安装目录下的sbin用来启动等操作，bin用来hdfs dfs 的各种命令（mkdir等等）
hdfs的命令就是一系列shell的分装，即便是shell mkdir了一个目录，也可以用hdfs的 put把文件拷贝到这个目录中，也就是说不一定hdf创建的目录，才能操作，可以进行任何操作，和shell一样的这里只能用hadoop 不能用hdfs，不写reduce，就会把map的结果写入out

hdfs namenode存储文件信息，各个datanode都会存储文件
hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -mapper /hdfs_map.py -input /hadoop -output /hadoop_out 如果是zip文件，这样，会把整个zip作为输入进行计算的

使用

**python脚本要给+x 和#！/usr/bin/env python **

#!/usr/bin/env pythonimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print '%s\t%s' % (word, 1)

我新建三个文件内容分别是：
file1 hello
file2 hello
file3 hello
最后输出了一个文件part-00000

当把代码改成

#!/usr/bin/env pythonimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print 'hello world'

输出

增加reduce后

reduce.py
print ‘reduce’

这样就只输出一个reduce在part-00000

注意

map-reduce 会把内容按行随机输出并排序，所以需要在一个文件内分析的话，就不合适

streaming 传递参数给脚本

http://www.cnblogs.com/zhengrunjian/p/4536572.html
-cmdenv

阅读全文

0 0