针对于分布式平台hadoop取数据的两种方式

来源：互联网发布：道琼斯指数行情软件编辑：程序博客网时间：2024/05/02 00:24

大数据平台下取数据的两种方式：

1.直接 hive

Hive 执行现在用到三种方式：

1. 直接进入hive> balabala..... 这个就是正常的与用户交互页面然后直接mysql 进行存取 ps: 执行语句的最后加上；

eg:

hive> exit 完全没反应

hive> exit; 顺利退出

2. 可以 hive -e "select * from table " > file 很容易上手

3. 可以 hive -f file file中则是各种执行语句

考虑到数据的数量这里切记搞清楚数据怎么存放的 xx公司的数据就是按天存储的

所以要加上限制：

select * fromtable where concat(year,month,day)=20160315; 这里强调 concat 函数

此处写各种脚本训练

2.Map-reduce 来获取

在平台上执行之后跟之前自己搭建的伪分布式平台上感觉完全不一样

Mp程序可以有python版本的，也可以有java版本的

其实版本的无所谓，重要的还是理解MP的用法和内在的东西~

1.对于python版本的：

Mapper.py (进行具体各种操作)

#!/usr/bin/env pythonimport sysfor line in sys.stdin:    line = line.strip()    words = line.split()    for word in words:        print "%s\t%s" % (word, 1)

Reducer.py(统计)

#!/usr/bin/env pythonfrom operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin:    line = line.strip()    word, count = line.split('\t', 1)    try:        count = int(count)    except ValueError:  #count如果不是数字的话，直接忽略掉        continue    if current_word == word:        current_count += count    else:        if current_word:            print "%s\t%s" % (current_word, current_count)        current_count = count        current_word = wordif word == current_word:  #不要忘记最后的输出    print "%s\t%s" % (current_word, current_count)

http://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop-a-beginners-tutorial.html

PS:

由于分布式数据量大，一定要先在本地没错之后，再提交运行

刚对python的这种对于streaming 的操作上手，发现multiple inputs ,single output 没有很好的解决方案

Note:hadoop的streaming还要继续深入学习下不一定得是python shell脚本也可以

2.java版本的下次再细谈下

0 0