数据装载

来源:互联网 发布:java将字符串转为数组 编辑:程序博客网 时间:2024/05/12 21:43
数据装载
数据来源:直接导入(ERP/CRM),新数据持续生成(web点击日志,移动端,传感器)
ETL:extract-->transform-->load
1、结构化数据转载– Apache Sqoop
Sqoop(SQl to hadOOP)技术的核心:index、compression
Sqoop的突出特点:并行、extension
Sqoop的用法:
import
Incremental Import
Importing Only New Data
Incrementally Importing Mutable Data
Free-Form Query Import
export
Updating an Existing Data Set
Exporting into a Subset(子集) of Columns

2、日志式数据装载– Apache Flume
是什么:
事件或日志数据的分布式数据传输和汇总系统;
主要设计用来把数据连续输入Hadoop;
可以将大量事件流数据从一个位置转移到另一个位置,或者从we不服务器转移到Hadoop集群。
综述:
Flume代理(安装到每个节点;收集事件);
拦截器(过滤无用事件;通过添加元数据来装饰事件(时间戳、主机名、静态标记));
加密(在磁盘上对文件里的事件加密);
将事件发送至下一级代理;
支持压缩;
事件可多路复用于多个代理(为了分散负载);
为了冗余而重复(为了避免一个代理、磁盘或者节点永久失效);
事件可以被传递为“失效备援failover”模式(为了防止失效的下一级代理;在优先级列表里尝试下一级代理);
事件可被负载平衡(round robin、随机、定制;为分散负载去不同的下一级代理);
事件可被存储(记忆、性能;磁盘、持久);
事件可最终转移到HDFS(多样文件格式;支持压缩);
Flume特点:Distributed(分布式)、Scalable(可伸缩性)、Reliable(可信赖性)、Manageable(易管理性)
Flume架构:

• An input is called a source and an output is called a sink.
• A channel provides the glue between a source and a sink.
• All of these run inside a daemon called anagent.

• A source writes events to one or more channels.(一个source可以对应多条通道)
• A channel is the holding area as events arepassedfrom a source to a sink.
• A sink receives events from one channel only.
• An agent can have many sources, channels, and sinks.
Event(一个事件可以有多个头文件)
Source(Example: Spooling Directory Source假脱机目录源)
Channel(for buffering incoming events)
memory channel
file channel ( Events stored in file on disk )
sink ( Removes events from a Channel and forwards them to theirnext destination )
会根据每层的代理数(4-16个代理)算层数,会算满足需求的通道能力。

3、分布式发布订阅消息系统– Apache Kafka
• producer向某个topic发布消息
• 而consumer订阅某个topic的消息
• 进而一旦有新的关于某个topic的消息
• broker(经纪人)会传递给订阅它的所有consumer
特点:strong durability, scalability(可扩展) and fault-tolerance

Kafka与Flume比较
Compared to Flume, Kafka wins on the its superb scalability and message durability.
Kafka can handle events at 100k+ per second rate coming from producers.
Kafka also supports different consumption model. On the contrary, Flume sink supportspushmodel.