阿里 离线数据同步工具 DataX 初试

来源:互联网 发布:软件注册赚钱靠谱吗 编辑:程序博客网 时间:2024/06/07 10:31

DataX : 一个异构数据源离线同步框架,通过插件体系完成数据同步过程。reader插件用于读入,writer插件用于写出,中间的framework可以定义transform插件完成数据转化的需要。

Sqoop 只支持关系型数据库与HDFS/Hive 之间的数据同步, DataX 则更为丰富。

目前支持的数据源有:https://github.com/alibaba/DataX/wiki/DataX-all-data-channels

使用:

$ tar zxvf datax.tar.gz$ sudo chmod -R 755 {YOUR_DATAX_HOME}$ cd  {YOUR_DATAX_HOME}/bin$ python datax.py ../job/job.json

json配置例子(Mongo > HDFS/Hive):

mongotest.json

{    "job": {        "setting": {            "speed": {                "channel": "2"            }        },        "content": [{                "reader": {                    "name": "mongodbreader",                    "parameter": {                        "address": [""],                        "userName": "",                        "userPassword": "",                        "dbName": "",                        "collectionName": "",                        "column": [{                                "name": "cityid",                                "type": "string"                            }, {                                "name": "searchstr",                                "type": "string"                            }, {                                "name": "pv",                                "type": "string"                            }                        ]                    }                },                "writer": {                    "name": "hdfswriter",                    "parameter": {                        "column": [{                                "name": "cityid",                                "type": "string"                            }, {                                "name": "searchstr",                                "type": "int"                            }, {                                "name": "pv",                                "type": "int"                            }                        ],                        "defaultFS": "hdfs://*",                        "fieldDelimiter": "\t",                        "fileName": "mongotest",                        "fileType": "text",                        "path": "/user/hive/warehouse/temp.db/mongotest",                        "writeMode": "append"                    }                }            }        ]    }}

同步过程:

  1. create Hive table temp.mongotest
  2. python {DATAX_HOME}/bin/datax.py ../mongotest.json