big data相关的技术文章

来源：互联网发布：平板支撑知乎编辑：程序博客网时间：2024/05/19 18:41

Spark分布式计算平台

来自于：大数据技术作者：hzguoding 2014-08-13 14:24

Spark简介

UC伯克利 AMP实验室(2011)

当前版本0.8.1，加入Apache孵化项目

Lighting-Fast Cluster Computing

http://spark.incubator.apache.org/

Spark部署

Cluster Mode Overview

SparkContext是用户执行任务的核心控制句柄

Cluster Manager是集群的控制载体

目前支持的三种集群载体模式：

1 Standalone

2 Apache Mesos

3 Hadoop YARN

Standalone部署(Master-Slave)

1 下载，编译

2 编辑配置文件

3 执行启动脚本

Spark编码

Spark基于Scala开发

编程接口支持Scala, Java, Python

Fast Programming

RDD(Resilient Distributed Datasets)

RDD是全局抽象的分布式存储句柄；

Map-Reduce的job执行过程中，输入输出需要用指定hdfs的路径并做记录；

RDD对象的创建三种方式：

1 基于内存存储的容器对象

2 基于文本文件

3 基于Hadoop输入格式

RDD(Resilient Distributed Datasets)

基于内存容器：

List<Double> list = new ArrayList<Double]();

JavaRDD<Double> rdds = sc.parallelize(list);

基于文本文件：

JavaRDD<String> rdds = sc.textFile(“hdfs://xxx/user/files/wordcount.txt”);

基于hadoop输入文件：

JavaPairRDD<LongWritable, Text> rdds = sc. hadoopFile(“hdfs://xxx/user/files/wordcount.txt”, TextInputFormat.class, LongWritable.class, Text.class);

RDD Operation

RDD可支持的操作包括: map, reduce, filter, flatMap, sample, union, distinct, groupByKey, reduceByKey, join, cogroup, cartesian, count, foreach, saveAsTextFile, saveAsSequenceFile等等。

RDD Persistence

不同的持久化级别：MEMORY_ONLY(cache), MEMORY_AND_DISK, DISK_ONLY。

rdds.cache(), rdds.persist(storage_level)

内存计算的优势

编码简单，数据句柄操作可视化。

系统稳定性远不如Hadoop。

回归类迭代计算，内存足够大，划算。

================================================================================

Storm基础

来自于：大数据技术作者：李刚锐 2014-08-13 11:06

本文分别介绍Storm和Storm Trident的一些基础知识，适合初学者快速理解掌握Storm。其中一些基本概念都简单提一下，主要介绍中间一些比较重要的东西。

一、 Storm

Storm的工作任务称为一个Topology，类似于MapReduce中的Job。

Storm集群中包含两类节点：主节点（Master Node）和工作节点（Work Node）。其分别对应的角色如下：

主节点（Master Node）上运行一个被称为Nimbus的后台程序，它负责在Storm集群内分发代码，分配任务给工作机器，并且负责监控集群运行状态。Nimbus的作用类似于Hadoop中JobTracker的角色。

工作节点（Work Node）上运行一个被称为Supervisor的后台程序。Supervisor负责监听从Nimbus分配给它执行的任务，据此启动或停止执行任务的工作进程。每一个工作进程执行一个Topology的子集；一个运行中的Topology由分布在不同工作节点上的多个工作进程组成。

二、 Topology

一个Topology由很多个功能节点组成，各个节点组成一个有向图，每两个节点之间可以传递数据。

节点分为2种：Spout和Bolt。Spout是数据源，即整个Topology执行的起始点；Bolt为中间的各个计算节点。

数据在各个节点之间是以tuple来传输的，tuple是最小的传输单元。

从拓扑结构上来看，每2个节点之间有一个连接，而实际上是有多个并发的。

对并发的理解：

一个topology内有若干个Worker Process；

一个Worker Process里边有多个线程，每个线程是一个executor，对应一个Bolt或者Spout；

每个executor内有多个task；

每个task执行一个实际的数据处理.

在代码中，以下内容是用来设置并发的：

ParallelismHint，指定某个bolt初始的executor数量，即线程数；

Bolt.setNumTasks：设置task；

Config.setNumWorkers：设置worker
而由于并发，上一个节点执行完以后实际上有很多个后续节点，那么它应该把tuple发送给哪个后续节点继续计算呢？

在Storm中，把这个过程称为Stream Grouping，而分发的方式有以下几种：
1.Shuffle Grouping：将Tuple随机分配到下游的Bolt
2.Fields Grouping：保证相同Fields值的tuple会被发送到同一个Bolt
3.All Grouping：广播，每个tuple所有的Bolt都会收到
4.Global Grouping：所有的Stream都流向task id最低的那个task。
5.Non Grouping：与Shuffle一样的效果，区别在于会把这个Bolt放到与订阅Bolt同一个线程中执行。。
6.Direct Grouping：这是一种比较复杂的分组方法。。。它规定了tuple的producer来管理由哪个consumer的task来接受这个tuple。。这个比较复杂。
7.Local or shuffle Grouping：这是一种为了提高效率的随机Grouping方法，当一个Bolt的多个Task都在同一个Worker process中的时候，tuple会随机分配到这些正在运行的task中，否则就是普通的Shuffle Grouping
数据在节点之间传输，代码是通过 collector.emit(new Values(tuple)) 来实现的。每个Tuple都是一个Value类型的变量，即一个Object列表，它可以包含很多数据，比如 new Values(123, "String", new Date()， 123L, 12.3F, null)等。
在接受的节点，可以强制转换，即通过 tuple.getValueByField(_sourceName); 获得的Value，可以直接cast为上一个节点传递的object
（TODO：只测试过基本类型，包括List，类不知道可否直接cast）。
而数据在传输的时候，一个一个tuple传输的效率有时比较低，Storm后来有了一种批传输的方式。即将多个tuple在一个batch中传输。
但这样有时效率也不高，后来又有了Batch Transaction的方式，即将一个Batch内的多个tuple先合并运算，这样传输的数量就会减少。这个过程分为2个阶段：
1.Processing Phase：该阶段将一个Batch内的数据进行计算。这个过程可以并行执行，提高效率。
2.Commit Phase：将batch的结果按照严格顺序提交，保证Transaction。
另外，在Storm中，Spout和Bolt都是可序列化(implements Serializable)的。
关于序列化的理解：

对于spout、bolt来讲，他们中的成员变量需要是serializable的，是因为worker挂掉的时候，supervisor会将这些worker的数据序列化以后保存起来。然后supervisor在重新启动新的worker的时候，会把这些数据加载进去。在重新加载的时候，不会调用构造函数，而是从之前supervisor保存的数据中加载，并调用open方法。
因此，在这些Spout、Bolt的构造函数中用到的所有类成员变量都必须是Serializable的。其他的成员变量，如果不在构造函数中使用，可以不是Serializable的，例如在open中初始化这些变量，会在worker启动的时候调用open来重新调用。

三、 Storm Trident

Storm Trident是对Storm的一层封装，并且封装的代码都是很高效的。这使得我们可以更快捷的进行开发。

Trident将功能封装成一个个的原语，有链接、聚合、分组、用户自定义功能和过滤等。以最简单的单词统计为例进行说明：

TridentTopology topology = new TridentTopology();

TridentState wordCounts =

topology.newStream("spout1", spout)

.each(new Fields("sentence"), new Split(), new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))

.parallelismHint(6);

topology.newDRPCStream("words")

.stateQuery(wordCounts, new Fields("args"), new MapGet(), new Fields("count"));

首先建立一个Spout源FixedBatchSpout，不断的发送数据出来------调用emitBatch发送数据！！发出来的是一个个的句子。

然后创建一个TridentTopology，并建立TridentState对Spout进行监听，并通过each、groupBy等进行处理进行统计。然后将统计结果保存在叫做TridentState的状态中，上述代码中该state变量叫做wordCounts。
然后创建一个DRPCStream，用于外部调用的去查询上文的TridentState的状态。
外部调用的时候，执行
new DRPCClient("server", port).execute("words", "cat the dog jumped") 调用的时候，就是一次远程调用，去统计之前统计的所有数量中cat the dog jumped这几个词的数量。

四、关于聚合操作：

做聚合操作，类似于SQL语句中的select count(*), sum(count) 之类的。按照严格标准的SQL语法，有聚合的时候，未聚合的列都要group by。
在Trident中，做类似操作通常是利用aggregate、partitionAggregate、aggregatePersist 结合 groupBy方法来做。

若想对一批数据做多种聚合：
利用chainedAgg和chainEnd配合起来用于对组同时进行多种聚合操作，如下所示：
.chainedAgg()
.partitionAggregate(new Fields("url"), new Count(), new Fields("url_cnt"))
.partitionAggregate(new Fields("byte"), new Sum(), new Fields("bytes_sum"))
.chainEnd()
注意：
chainEnd会对Fields进行过滤，输入的Fields将不再保留。而partitionAggregate不会对Fields过滤的。
如本例中，输出的Fields中只包含url_cnt和bytes_sum，不再包含url和byte。但是其他的列（未经partitionAggregate处理的列）不会影响。。。
通常，partitionAggregate是和groupBy一起用的，过滤后的列只剩下groupBy和partitionAgg生成的列。

<<<<<end 关于聚合操作

前边提过，计算的中间过程可以保存在state中。
state有3种：non-transactional，repeat-transactional，opaque-transactional

对State有两种操作------
QueryFunction：查询操作
StateUpdater：更新操作
QueryFunction
QueryFunction的执行过程：将输入传递给batchRetrieve函数，进行相应的处理，返回一个List。
例如：stateQuery(locations, new Fields("userid"), new QueryLocation(), new Fields("location"))
作用是根据用户id查询位置信息，输入的是用户id的list(new Fields("userid"))，输出就是用户位置信息的List(new Fields("location"))。

public class QueryLocation extends BaseQueryFunction<LocationDB, String> {

public List<String> batchRetrieve(LocationDB state, List<TridentTuple> inputs) {

List<String> ret = new ArrayList();

for(TridentTuple input: inputs)

{

ret.add(state.getLocation(input.getLong(0)));

}

return ret;

}

public void execute(TridentTuple tuple, String location, TridentCollector collector)

{

collector.emit(new Values(location));

}

在QueryFunction里有2个函数------
List<T> batchRetrieve(S state, List<T> input)：根据输入的List，从State中查询或者其他操作，返回一个List
void execute：提交
StateUpdater
在updateState函数中进行更新操作
例如：.partitionPersist(new LocationDBFactory(), new Fields("userid", "location"), new LocationUpdater())

public class LocationUpdater extends BaseStateUpdater<LocationDB> {

public void updateState(LocationDB state, List<TridentTuple> tuples, TridentCollector collector) {

List<Long> ids = new ArrayList<Long>();

List<String> locations = new ArrayList<String>();

for(TridentTuple t: tuples)

{

ids.add(t.getLong(0)); locations.add(t.getString(1));

}

state.setLocationsBulk(ids, locations);

}

上边的partitionPersist函数是执行更新操作的

五、其他注意事项：

partitionPersist前必须要用partitionBy。

可以调用TridentUtils.fieldsUnion对各个fields求交集。(fieldsUnion与fieldsConcat的区别是，前者去除掉相同的fields)。

================================================================================

Storm中访问HDFS

来自于：大数据技术作者：李刚锐 2014-08-13 11:32

一、 Hadoop客户端配置

hadoop jar打入storm的package或加入storm的lib目录

把core-site.xml, mapred-site.xml, hdfs-site.xml, 从而在storm可以初始化hadoop的configuration

二、 Security验证

把keytab文件传入，转为可以序列化的字节数组，使得可以在spout，bolt之间传递。

BufferedInputStream in = new BufferedInputStream(new FileInputStream(keytabFile));

ByteArrayOutputStream out = new ByteArrayOutputStream(1024);

byte[] temp = new byte[1024];

int size = 0;

while ((size = in.read(temp)) != -1) {

out.write(temp, 0, size);

}

in.close();

this.priniciple = principle;

this.keytabContent = out.toByteArray();

在验证时使用byte数组创建临时文件，验证kerberos

Configuration hadoopConf = new Configuration();

//hadoopConf.set(FS_DEFAULT_NAME_KEY, this.fsName);

hadoopConf.set("hadoop.security.authentication", "kerberos");

UserGroupInformation.setConfiguration(hadoopConf);

//UserGroupInformation.loginUserFromKeytab(principle, keytab);

InputStream keytabFile = new ByteArrayInputStream(this.keytabContent);

File temp = File.createTempFile("stream_sql", "keytab");

temp.deleteOnExit();

IOUtils.copyBytes(keytabFile, new FileOutputStream(temp), 1024, true);

UserGroupInformation.loginUserFromKeytab(this.priniciple, temp.getAbsolutePath());

//remove the temp file

temp.delete();

三、 LZO编码问题

在storm的package中加入hadoop-lzo或者把hadoop-lzo加入storm的lib目录。

设置LD_LIBRARY_PATH（加入HADOOP_HOME/lib/native/Linux-amd64-64)使得可以加载native gpl library。

在storm配置中设置java.library.path,把lzo的路径加入到java.library.path

四、多节点同时读取一个文件多个block

使用和map/reduce相同的方法（InputSplit)

使得InputSplit可以被序列化，使用Wrapper重载readObject和writeObect。

private void writeObject(ObjectOutputStream s) throws IOException {

s.defaultWriteObject();

new ObjectWritable(this.writable).write(s);

}

private void readObject(ObjectInputStream ois) throws Exception {

ois.defaultReadObject();

ObjectWritable obj = new ObjectWritable();

obj.setConf(new JobConf());

obj.readFields(ois);

this.writable = (T) obj.get();

}

public T get() {

return this.writable;

}

创建InputSplit数组

String path = tuple.getString(0);

Configuration hConf = new Configuration();

JobConf jobConf = new JobConf(hConf);

//read the file path

FileInputFormat.addInputPath(jobConf, new Path(path));

jobConf.setInputFormat(TextInputFormat.class);

TextInputFormat input = new TextInputFormat();

input.configure(jobConf);

InputSplit[] splits = input.getSplits(jobConf, 2);

if (splits != null) {

for (InputSplit split: splits) {

collector.emit(new Values(new SerializeWritable<InputSplit>(split)));

}

并发处理Split消息

SerializeWritable<InputSplit> split = (SerializeWritable<InputSplit>)tuple.get(0);

if (split == null) {

return;

}

TextInputFormat input = new TextInputFormat();

JobConf jobConf = new JobConf();

input.configure(jobConf);

try {

RecordReader<LongWritable, Text> r = input.getRecordReader(split.get(), jobConf, Reporter.NULL);

LongWritable key = new LongWritable();

Text val = new Text();

while(r.next(key, val)) {

collector.emit(new Values(val.toString()));

}

r.close();

} catch (IOException e) {

e.printStackTrace();

}

================================================================================

0 0

big data相关的技术文章

Spark简介

Spark部署

Spark编码

内存计算的优势

一、 Storm

二、 Topology

三、 Storm Trident

四、 关于聚合操作：

五、 其他注意事项：

一、 Hadoop客户端配置

二、 Security验证

三、 LZO编码问题

四、 多节点同时读取一个文件多个block

四、关于聚合操作：

五、其他注意事项：

四、多节点同时读取一个文件多个block