分布式数据收集:(Flume原理与应用)
来源:互联网 发布:我的世界pe拔刀剑js 编辑:程序博客网 时间:2024/06/16 00:21
第一:背景介绍
常见的开源数据收集系统
- 非结构化日志(数据)收集
- flume
- 结构化日志(数据)收集
- sqoop全量导入
- canal(Alibaba)增量导入
- Databus(linkedin)增量导入
第二:Flume(NG)介绍
- Event
- flume以事件的形式传输数据单元
- 事件由一个header和载有数据的byte array构成
- header是一个字典结构的数据,可以在上下文路由中扩展使用
- Client
- client是一个将原始log包装成event并且发送他们到一个或者多个agent的实体
- client不是必须的
- Agent
source
- 接收或者产生event,并且批量发送到一个或者多个channel
不同类型的source
- 与系统集成的source:syslog,netCat
- 自动生成事件的source:Exce
- 监听文件夹下文件变化的:Spolling Directory Source,Talidir Source
- 用于Agent与Agent通信的IPC Source:Avro,Thrift
channel
- 位于Source和Sink之间,用于缓存event
- 支持事务
- 不同类型的channel
- Memory Channel:volatile
- File Channel:
- JDBC Channel:
- sink
- 将event传输到下一步或者最终目的地,成功后将event从Channel中清除
- 不同类型的sink
- 存储event到最终终端的sink:HDFS,HBASE
- 自动消耗的Channel:Null sink
- 用于Agent之间通信的IPC sink:Avro
第三:Sqooq介绍
- 传统关系型数据库和Hadoop之间的桥梁
- 把关系型数据的数据导入到hadoop系统
- 把hadoop系统的数据导入到关系型数据库中
- 利用MapReduce加快数据传输速度
- 批处理方式进行数据传输
第四:CDC介绍
- canal:
- databus
第五:项目实践
一,项目说明
将命名为record.list里面不断生成的内容,收集到hadoop集群中。其中source采用exec的sourc,Channel采用file的Channel,sink采用hdfs的sink
二,安装flume
[hadoop@hadoopa ~]$ tar -zxvf apache-flume-1.7.0-bin.tar.gz
三,配置flume
- 配置 flume-conf-logAnalysis.properties
[hadoop@hadoopa conf]$ pwd/home/hadoop/apache-flume-1.7.0-bin/conf[hadoop@hadoopa conf]$ vi flume-conf-logAnalysis.properties
[hadoop@hadoopa conf]$ cat flume-conf-logAnalysis.properties# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing,# software distributed under the License is distributed on an# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY# KIND, either express or implied. See the License for the# specific language governing permissions and limitations# under the License.# The configuration file needs to define the sources,# the channels and the sinks.# Sources, channels and sinks are defined per agent,# in this case called 'agent'logAgent.sources = logSourcelogAgent.channels = fileChannellogAgent.sinks = hdfsSink# For each one of the sources, the type is definedlogAgent.sources.logSource.type = execlogAgent.sources.logSource.command = tail -F /home/hadoop/hadooptraining/datasource/record.list# The channel can be defined as follows.logAgent.sources.logSource.channels = fileChannel# Each sink's type must be definedlogAgent.sinks.hdfsSink.type = hdfslogAgent.sinks.hdfsSink.hdfs.path = hdfs://hadoopA:8020/flume/record/%Y-%m-%d/%H%MlogAgent.sinks.hdfsSink.hdfs.filePrefix= transaction_loglogAgent.sinks.hdfsSink.hdfs.rollInterval= 600logAgent.sinks.hdfsSink.hdfs.rollCount= 10000logAgent.sinks.hdfsSink.hdfs.rollSize= 0logAgent.sinks.hdfsSink.hdfs.round = truelogAgent.sinks.hdfsSink.hdfs.roundValue = 10logAgent.sinks.hdfsSink.hdfs.roundUnit = minutelogAgent.sinks.hdfsSink.hdfs.fileType = DataStreamlogAgent.sinks.hdfsSink.hdfs.useLocalTimeStamp = true#Specify the channel the sink should uselogAgent.sinks.hdfsSink.channel = fileChannel# Each channel's type is defined.logAgent.channels.fileChannel.type = filelogAgent.channels.fileChannel.checkpointDir= /home/hadoop/apache-flume-1.7.0-bin/dataCheckpointDirlogAgent.channels.fileChannel.dataDirs= /home/hadoop/apache-flume-1.7.0-bin/dataDir# Other config values specific to each type of channel(sink or source)# can be defined as well# In this case, it specifies the capacity of the memory channel
- 配置 flume-env.sh
[hadoop@hadoopa conf]$ pwd/home/hadoop/apache-flume-1.7.0-bin/conf[hadoop@hadoopa conf]$ vi flume-env.sh
[hadoop@hadoopa conf]$ cat flume-env.sh# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# Give Flume more memory and pre-allocate, enable remote monitoring via JMXexport JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"# Let Flume write raw event data and configuration information to its log files for debugging# purposes. Enabling these flags is not recommended in production,# as it may result in logging sensitive user information or encryption secrets.# $JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "# Foll. classpath will be included in Flume's classpath.# Note that the Flume conf directory is always included in the classpath.FLUME_CLASSPATH="$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.3.jar" # Example: "path1;path2;path3"
四,执行flume
[hadoop@hadoopa conf]$ flume-ng agent --conf /home/hadoop/apache-flume-1.7.0-bin/conf --conf-file /home/hadoop/apache-flume-1.7.0-bin/conf/flume-conf-logAnalysis.properties --name logAgent -Dflume.root.logger=DEBUG,console -Dflume.monitoring.type=http -Dflume.monitoring.port=34545
五,验证结果
通过浏览器访问hdfs:
http://192.168.1.201:50070/explorer.html#/flume/record/
0 0
- 分布式数据收集:(Flume原理与应用)
- Hadoop-->Flume原理与应用
- Flume 分布式日志收集
- Flume安装与应用,日志收集保存
- Flume+Kafka收集Docker容器内分布式日志应用实践
- Flume Kafka收集Docker容器内分布式日志应用实践
- 大数据系列教程_分布式日志收集flume
- 分布式数据日志收集之Flume的安装和使用
- 分布式日志收集系统:Flume
- 使用Flume收集数据
- 使用 Flume 收集数据
- 分布式日志收集收集系统:Flume(转)
- Flume分布式日志收集收集系统
- 分布式日志收集收集系统:Flume
- 分布式日志收集收集系统:Flume
- Hadoop数据收集与入库系统Flume与Sqoop
- Hadoop数据收集与入库系统Flume与Sqoop
- [大数据]flume日志收集
- 历经8年双11流量洗礼,淘宝开放平台架构和技术难点解密
- java 反射通过get方法获得属性值
- vim 匹配 替换 删除
- SpringMVC常用的配置和注解
- java equal和==的比较
- 分布式数据收集:(Flume原理与应用)
- 测试
- out.flush(); out.close()问题
- Linux进程管理之“四大名捕”
- abbyy在使用时提示更改分辨率如何处理
- 非常火的retrofit+Rxjava浅析
- java编程思想之一切都是对象
- 吸血鬼数字
- linux内核分析——CFS(完全公平调度算法)