Sqoop安装和使用

来源：互联网发布：0基础大数据培训多少钱编辑：程序博客网时间：2024/05/14 22:28

SQOOP ---数据搬用工
可以将外部数据迁移到hdfs目录或者hive表或者hbase表
约定：安装目录为/opt/
下载地址：https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.6/
下载之后进行解压：
    tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/
重命名：
    mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha sqoop
配置SQOOP_HOME到环境变量中
   export SQOOP_HOME=/opt/sqoop
配置$SQOOP_HOME/conf/sqoop-env.sh
    export HADOOP_COMMON_HOME=/opt/hadoop
    export HBASE_HOME=/opt/hbase
    export HIVE_HOME=/opt/hive
    export ZOOCFGDIR=/opt/zookeeper/conf
注意：
1、数据库驱动：
   在执行sqoop命里的受需要拷贝相关数据库驱动jar包到$SQOOP_HOME/lib目录下，例如mysql需要mysql-connector-java-5.1.32-bin.jar以上版本支持。
2、JDK版本
   JDK版本最好1.7以上。
import
  由外部导入hdfs
export
  由hdfs导入外部
由mysql导入数据到hdfs
person表中的数据
  1,zhaoyuan,28,male
  2,xutiannan,20,male
  3,xiaomei,18,female
  4,xiaodingding,17,male
sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person
  将msyql数据bigdata_db中的表person，导入到hdfs目录，该目录在/user/用户/person下面，其中person为导入的表名，
  这是sqoop导入到的默认目录，如果要想导入到指定的目录，添加一个选项--target-dir
sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person --target-dir /output/sqoop/person

  因为默认执行sqoop会有4个maptasks任务，为了满足业务的需要，可以进行修改，只需要在命令后面加一个选项-m
sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person --target-dir /output/sqoop/person -m 2
  执行的过程中，如果输出目录已经存在，报错，要想输出到该目录使用选项--delete-target-dir
sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person --target-dir /output/sqoop/person -m 2 --delete-target-dir
  如果想在原来的基础之上追加新的数据，只需要添加一个选项--append,但是注意，--append和--delete-target-dir不能同时存在
sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person --target-dir /output/sqoop/person -m 2 --append
条件导入：
  导入满足特定条件的数据：
   导入person表中，pname以xiao开头的数据
   sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username root --password root --table person --target-dir /output/sqoop/person -m 1 --append --where "pname like '%xiao'"
  通过sql导入：
   sqoop import --connect jdbc:mysql://master:3306/bigdata_db --username 'root' --password 'root' --query "select pid, pname, page, pgender from person where age=18 and \$CONDITIONS" -m 1 --append --fields-terminated-by "," --split-by ","
mysql---->Hive
F.Hive导入
  sqoop import --connect jdbc:mysql://master:3306/test --username 'root' --password 'root' --table htest --hive-import -m 1
  如果hive中没有想关的表，则会创建之
覆盖数据（只覆盖数据，不覆盖表结构）
  sqoop import --connect jdbc:mysql://master:3306/test --username 'root' --password 'root' --table htest --hive-import -m 1 --hive-overwrite
创建表名
  sqoop import --connect jdbc:mysql://master:3306/test --username 'root' --password 'root' --table htest --hive-import -m 1 --hive-table "htest_import" --hive-overwrite
导出所有的表到hive中
  sqoop import-all-tables --connect jdbc:mysql://master:3306/test --username root --password root --hive-import --fields-terminated-by "\001" --lines-terminated-by "\n"
H：导入数据到HBase
  sqoop import --connect jdbc:mysql://master:3306/test --username 'root' --password 'root' --table person --hbase-create-table --hbase-row-key id --hbase-table htest --column-family cf
  导入Hbase的时候，默认使用主键做key，没有主键使用--split-by，暂时处理不了联合主键，最好现在hbase中建立相关的表结构
  因为我们使用hbase版本1.1.5和sqoop版本1.4.6不兼容，所以导入hbase会失败
-----------------------------------------
export
导出到mysql表test
sqoop export --connect jdbc:mysql://master:3306/test --username root --password root --table test --export-dir /output/sqoop/person
导出的过程中出现异常-->.MySQLSyntaxErrorException: Table 'bigdata_db.test' doesn't exist
所以需要先创建该表test
CREATE TABLE `test` (
   `pid` int(11) NOT NULL AUTO_INCREMENT,
   `pname` varchar(20) COLLATE utf8_bin NOT NULL,
   `page` int(11) NOT NULL,
   `pgender` varchar(10) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

A.导出到MySQL(默认使用逗号作为分割)
   导出的时候字段需要一一对应
  sqoop export --connect jdbc:mysql://master:3306/test --username root --password root --table test --export-dir /export
   中文乱码：
  sqoop export --connect "jdbc:mysql://master:3306/test?useUnicode=true&characterEncoding=utf-8"--username root --password root --table test --export-dir /export
B.插入或更新
如果存在就更新，不存在就插入
  sqoop export --connect "jdbc:mysql://master:3306/test?useUnicode=true&characterEncoding=utf-8" --username root --password root --table test --export-dir /export -m 1 --update-key id --update-mode allowinsert
C.指定分隔符
  和导入类似 -input-fields-terminated-by解析HDFS上面的数据到数据库时使用参数
D.从Hive到MySQL
  sqoop export --connect jdbc:mysql://master:3306/test --username root --password root --table htest --export-dir /user/hive/warehouse/htest --input-fields-terminated-by '\001'
=================================================================================
FLUME
是一个专业的分布式的、高可靠的海量日志采集工具。
有三大核心组件：
Source  ----->定义数据源
Channel  ----->对数据进行缓冲
Sink  ----->数据落脚点

系统要求：
1、JRE：JDK1.6+(推荐使用1.7)
2、内存：没有上限和下限，能够配置满足source、channel以及sink即可
3、磁盘空间：同2
4、目录权限：一般的agent操作的目录必须要有读写权限
这里采用的Flume版本为1.6.0，也是目前最新的版本，下载地址为：
http://archive.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
安装步骤：
解压缩：opt]# tar -zxvf soft/apache-flume-1.6.0-bin.tar.gz
重命名：opt]# mv apache-flume-1.6.0-bin flume
添加到环境变量中
  vim /etc/profile.d/hadoop-eco.sh
  export FLUME_HOME=/opt/flume
  export PATH=$FLUME_HOME/bin:$PATH
修改配置文件
  conf]# cp flume-env.sh.template flume-env.sh
添加JAVA_HOME
  export JAVA_HOME=/opt/jdk
------------------------------------------------------------------------------
第一个flume agent案例
定义flume agent配置文件

$FLUME_HOME/conf/example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
#定义了当前agent的名字叫做a1
a1.sources = r1 ##定了该agent中的sources组件叫做r1
a1.sinks = k1 ##定了该agent中的sinks组件叫做k1
a1.channels = c1 ##定了该agent中的channels组件叫做c1

# Describe/configure the source --->对source的描述
a1.sources.r1.type = netcat  #source的类型为网络字节流
a1.sources.r1.bind = master  #source监听的网络的hostname
a1.sources.r1.port = 44444  #source监听的网络的port

# Describe the sink ---->对sink的描述
a1.sinks.k1.type = logger #sink的类型为logger日志方式，log4j的级别有INFO、Console、file。。。

# Use a channel which buffers events in memory -->描述channel
a1.channels.c1.type = memory #channel的类型使用内存进行数据缓存，这是最常见的一种channel
a1.channels.c1.capacity = 1000 #定义了channel对的容量
a1.channels.c1.transactionCapacity = 100 #定义channel的最大的事务容量

# Bind the source and sink to the channel ---->需要将source和sink使用channel连接起来，组成一个类似流水管道
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume的agent

运行脚本：
bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1 -Dflume.root.logger=INFO,console
简单写法：
bin/flume-ng agent -c conf -f conf/example.conf --name a1 -Dflume.root.logger=INFO,console
运行的时候：安装发给大家的nc-xxxx.rpm
执行命令nc master 44444
----------------------------------------------------------------
tail -f和tail -F的区别
http-xxx.log
http-xxx.log.2017-03-15
http-xxx.log.2017-03-16
-f不会监听分割之后的文件，而-F则会继续监听。
案例二：
使用flume监听一个文件内容的变化，当有新内容产生，则收集该文件中的内容，展示到控制台上
$FLUME_HOME/conf/exec-file.conf

# Name the components on this agent
#定义了当前agent的名字叫做a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/logs/http-flume.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

案例三：
使用flume监听一个目录的变化，当有新文件产生，则将该文件的内容展示到控制台上
$FLUME_HOME/conf/exec-dir.conf
# Name the components on this agent
#定义了当前agent的名字叫做a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/data/logs
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
案例四：
对三进行修改，将监听到的文件上传至hdfs的目录/input/flume/

$FLUME_HOME/conf/exec-hdfs.conf
# Name the components on this agent
#定义了当前agent的名字叫做a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/data/logs
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:9000/input/flume/%y/%m
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 40
a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

操作过程几个问题：
Expected timestamp in the Flume event headers, but it was null
/input/flume/%y/%m/---->中的%y/%m/造成的
这个问题是因为我们flume客户端没有获取当前系统的时间戳操作的，我们需要设置
  hdfs.useLocalTimeStamp = true，其默认为false
其次：local host is: "master/192.168.43.100"; destination host is: "slave01":9000
  hdfs主机名进行选择的时候有问题，我们只需要将a1.sinks.k1.hdfs.path写成绝对路径即可
  hdfs://master:9000/input/flume/%y/%m
最后一个问题：
  涉及roundUnit和roundValue的timeout，因为在roundValue的之间段内进行的操作，如果超过这个时间则会报错，
  所以我们要通过测试，尽量选择一个合适的时间单位。

Flume和Kafka

0 0