sqoop使用心得~~~

来源:互联网 发布:js字符串去空格 编辑:程序博客网 时间:2024/05/13 02:39



sqoop是么哥?sqoop就是封装了RDB与hdfs之间互相cp数据的一个工具,没有它,我们当然也可以写一坨脚本,手动执行,但是用它会更方便,so。。。


安装步骤:

1、去 http://archive.cloudera.com/cdh/3/ 下载开发包,现在一般下cdh3u3。

2、下载jdbc rdb 驱动jar包,我测试用的是mysql 驱动。

3、解压完之后把mysql 驱动包放入sqoop目录的lib下。

4、配置环境变量:前提为hadoop,hive,hbase,zk都已安装,且配置正确,只需要配置SQOOP_HOME即可,因为我发现bin/configure-sqoop 里面配置了

export SQOOP_CLASSPATH
export SQOOP_CONF_DIR
export SQOOP_JAR_DIR
export HADOOP_CLASSPATH
export HADOOP_HOME
export HBASE_HOME

SQOOP_HOME是直接引用的,所以需要配,否则会找不到jar包

如果hbase,zk未安装,则要去configure-sqoop 里注释掉相应的检查语句。


测试:

RDB到hdfs:

[hadoop@master cuirong]$ sqoop import --connect jdbc:mysql://master:3306/sqoop --username sqoop --password sqoop123 --table t_etl_cmd_relation -m 1 12/03/01 12:12:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.12/03/01 12:12:56 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.12/03/01 12:12:56 INFO tool.CodeGenTool: Beginning code generation12/03/01 12:12:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_etl_cmd_relation` AS t LIMIT 112/03/01 12:12:56 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop12/03/01 12:12:58 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/9ced3ca3be0a259acc993a93328c40c8/t_etl_cmd_relation.jar12/03/01 12:12:58 WARN manager.MySQLManager: It looks like you are importing from mysql.12/03/01 12:12:58 WARN manager.MySQLManager: This transfer can be faster! Use the --direct12/03/01 12:12:58 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.12/03/01 12:12:58 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)12/03/01 12:12:58 INFO mapreduce.ImportJobBase: Beginning import of t_etl_cmd_relation12/03/01 12:13:00 INFO mapred.JobClient: Running job: job_201202212041_1546712/03/01 12:13:01 INFO mapred.JobClient:  map 0% reduce 0%12/03/01 12:13:06 INFO mapred.JobClient:  map 100% reduce 0%12/03/01 12:13:06 INFO mapred.JobClient: Job complete: job_201202212041_1546712/03/01 12:13:06 INFO mapred.JobClient: Counters: 1612/03/01 12:13:06 INFO mapred.JobClient:   Job Counters 12/03/01 12:13:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=492512/03/01 12:13:06 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=012/03/01 12:13:06 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=012/03/01 12:13:06 INFO mapred.JobClient:     Launched map tasks=112/03/01 12:13:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=012/03/01 12:13:06 INFO mapred.JobClient:   FileSystemCounters12/03/01 12:13:06 INFO mapred.JobClient:     HDFS_BYTES_READ=8712/03/01 12:13:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=5616112/03/01 12:13:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=113112/03/01 12:13:06 INFO mapred.JobClient:   Map-Reduce Framework12/03/01 12:13:06 INFO mapred.JobClient:     Map input records=1612/03/01 12:13:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=13078118412/03/01 12:13:06 INFO mapred.JobClient:     Spilled Records=012/03/01 12:13:06 INFO mapred.JobClient:     CPU time spent (ms)=64012/03/01 12:13:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=50534809612/03/01 12:13:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=153340313612/03/01 12:13:06 INFO mapred.JobClient:     Map output records=1612/03/01 12:13:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8712/03/01 12:13:06 INFO mapreduce.ImportJobBase: Transferred 1.1045 KB in 8.0517 seconds (140.4665 bytes/sec)12/03/01 12:13:06 INFO mapreduce.ImportJobBase: Retrieved 16 records.测试:

导出的数据默认到hadoop hdfs主目录下,可使用 参数指定其它目录。

也可以直接导入hive,具体用法请参见官方文档。

hdfs到RDB:


[hadoop@master cuirong]$ sqoop export --connect jdbc:mysql://master:3306/sqoop --username sqoop --password sqoop123 --table t_etl_cmd_relation --export-dir /tmp/cui1/part-m-0000012/03/01 12:52:10 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.12/03/01 12:52:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.12/03/01 12:52:10 INFO tool.CodeGenTool: Beginning code generation12/03/01 12:52:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_etl_cmd_relation` AS t LIMIT 112/03/01 12:52:10 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop12/03/01 12:52:11 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/44fdcaa5fbd99233f89008548b08b370/t_etl_cmd_relation.jar12/03/01 12:52:11 INFO mapreduce.ExportJobBase: Beginning export of t_etl_cmd_relation12/03/01 12:52:13 INFO input.FileInputFormat: Total input paths to process : 112/03/01 12:52:13 INFO input.FileInputFormat: Total input paths to process : 112/03/01 12:52:13 INFO mapred.JobClient: Running job: job_201202212041_1548012/03/01 12:52:14 INFO mapred.JobClient:  map 0% reduce 0%12/03/01 12:52:19 INFO mapred.JobClient:  map 100% reduce 0%12/03/01 12:52:19 INFO mapred.JobClient: Job complete: job_201202212041_1548012/03/01 12:52:19 INFO mapred.JobClient: Counters: 1612/03/01 12:52:19 INFO mapred.JobClient:   Job Counters 12/03/01 12:52:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=409312/03/01 12:52:19 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=012/03/01 12:52:19 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=012/03/01 12:52:19 INFO mapred.JobClient:     Rack-local map tasks=112/03/01 12:52:19 INFO mapred.JobClient:     Launched map tasks=112/03/01 12:52:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=012/03/01 12:52:19 INFO mapred.JobClient:   FileSystemCounters12/03/01 12:52:19 INFO mapred.JobClient:     HDFS_BYTES_READ=123912/03/01 12:52:19 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=5596412/03/01 12:52:19 INFO mapred.JobClient:   Map-Reduce Framework12/03/01 12:52:19 INFO mapred.JobClient:     Map input records=1612/03/01 12:52:19 INFO mapred.JobClient:     Physical memory (bytes) snapshot=12319539212/03/01 12:52:19 INFO mapred.JobClient:     Spilled Records=012/03/01 12:52:19 INFO mapred.JobClient:     CPU time spent (ms)=68012/03/01 12:52:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=50534809612/03/01 12:52:19 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=145747968012/03/01 12:52:19 INFO mapred.JobClient:     Map output records=1612/03/01 12:52:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=11412/03/01 12:52:19 INFO mapreduce.ExportJobBase: Transferred 1.21 KB in 7.2897 seconds (169.9657 bytes/sec)12/03/01 12:52:19 INFO mapreduce.ExportJobBase: Exported 16 records.

hdfs文件分隔符的问题还是要注意的,mysql默认分隔符是逗号,如果hdfs文件分割符是其它的需要制定参数 ,如:--input-fields-terminated-by '\001'

原创粉丝点击