华为开源存储框架Carbondata在Cent.OS7.2下的编译到使用

来源:互联网 发布:node服务器和appache 编辑:程序博客网 时间:2024/05/19 02:03
1 介绍
Apache CarbonData是一个面向大数据平台的基于索引的列式数据格式,由华为大数据团队贡献给Apache社区,目前最新版本是1.0.0版,官网地址: http://carbondata.apache.org。介于目前主流大数据组件应用场景的局限性,CarbonData诞生的初衷是希望通过仅保存一份数据来满足不同的应用场景,如:
  • OLAP
  • 顺序存取(Sequential Access)
  • 随机存取(Random Access)
2 环境
Apache CarbonData 必须在和Spark搭配使用,但编译阶段不一定需要使用spark,编译阶段用到的环境有一下几个:
1、JDK (本次使用的是1.8)
2、Maven (CentOS使用yum安装即可,ubantu使用apt-get 来安装)
3、从官网下载的源码包apache-carbondata-1.0.0-incubating-source-release.zip
4、安装Thrift
CentOS7下安装步骤:
1. Update the System
sudo yum -y update
2. Install the Platform Development Tools
sudo yum -y groupinstall"Development Tools"
3. Upgrade autoconf/automake/bison
sudo yuminstall -y wget
3.1 Upgrade autoconf
wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gztar xvf autoconf-2.69.tar.gzcd autoconf-2.69./configure --prefix=/usrmakesudo make installcd ..
3.2 Upgrade automake
wget http://ftp.gnu.org/gnu/automake/automake-1.14.tar.gztar xvf automake-1.14.tar.gzcd automake-1.14./configure --prefix=/usrmakesudo make installcd ..
3.3 Upgrade bison
wget http://ftp.gnu.org/gnu/bison/bison-2.5.1.tar.gztar xvf bison-2.5.1.tar.gzcd bison-2.5.1./configure --prefix=/usrmakesudo make installcd ..
4. Add Optional C++ Language Library Dependencies
4.1 Install C++ Lib Dependencies
sudo yum -yinstall libevent-devel zlib-devel openssl-devel
4.2 Upgrade Boost >= 1.53
wget http://sourceforge.net/projects/boost/files/boost/1.53.0/boost_1_53_0.tar.gztar xvf boost_1_53_0.tar.gzcd boost_1_53_0./bootstrap.shsudo ./b2 install
5. Build and Install the Apache Thrift IDL Compiler
git clone https://git-wip-us.apache.org/repos/asf/thrift.gitcd thrift./bootstrap.sh./configure --with-lua=nomake
在 make 这一步会发生一个错误 g++: error: /usr/lib64/libboost_unit_test_framework.a: No such file or directory
错误原因是:./configure 的时候是默认编译32位的,不会在 /usr/lib64/ 下产生文件
修改方法:先查找文件 find / -name libboost_unit_test_framework.a,比如在 /usr/local/lib/libboost_unit_test_framework.a,就可以做如下操作,sudo ln -s /usr/local/lib/libboost_unit_test_framework.a /usr/lib64/libboost_unit_test_framework.a,然后重新执行 make
sudo makeinstall

Ubantu 使用apt-get安装即可
3 编译及部署
# unzipapache-carbondata-1.0.0-incubating-source-release.zip
# cdapache-carbondata-1.0.0-incubating-source-release
# mvn clean package -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 -Phadoop-2.6.0
由于网络原因,从Maven中央仓库下载jar包可能非常慢,大家可根据自己的实际情况修改为企业内部私有仓库或阿里云等外部源,修改/usr/share/maven/conf/setting.xml
<mirror>
<id>nexus</id>
<name>nexus</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>*</mirrorOf>
</mirror>

在apache-carbondata-1.0.0-incubating-source-release中的pom.xml 中可以进行修改,但不可随意改,比如carbondata对spark的支持目前来看只有1.6和2.1 且1.6还是1.6.2以上 2.1目前应该是支持到2.1.0,hadoop支持范围稍广。曾尝试把spark改为1.6.0,结果编译总是失败,最终用spark1.6.2 hadoop2.6.0来编译的 在spark1.6.0中仍可以用。

编译结束后会在 apache-carbondata-1.0.0-incubating-source-release/assembly/target/scala-2.10 下生成一个jar包,到此编译完成



4 测试
到这里我们已经编译好CarbonData了,而且把编译好的相关lib包添加到每个节点的${SPARK_HOME}/lib中
# cp carbondata_2.10-1.2.0-SNAPSHOT-shade-hadoop2.6.0.jar /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/lib/
# spark-shell --jars/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/lib/carbondata_2.10-1.2.0-SNAPSHOT-shade-hadoop2.6.0.jar
importorg.apache.spark.sql.CarbonContext
importjava.io.File
importorg.apache.hadoop.hive.conf.HiveConf
valstorePath="hdfs://192.169.77.224:8020/tmp/iteblog3/store"
valcc=newCarbonContext(sc, storePath)
cc.setConf("carbon.kettle.home","hdfs://192.169.77.224:8020/tmp/iteblog3/store/carbondata/carbonplugins")
cc.setConf("hive.metastore.warehouse.dir","hdfs://192.169.77.224:8020/tmp/iteblog3/metadata/")
cc.setConf(HiveConf.ConfVars.HIVECHECKFILEFORMAT.varname,"false")
cc.sql("create table if not exists iteblog (id string, hash string) STORED BY 'org.apache.carbondata.format'")
好了,表创建好了让我们来load点数据进去吧,我准备好了类似于以下的数据(名称为iteblog3.csv):
id      hash
1802202095,-9223347229018688133
1805433788,-9223224306642795473
1807808238,-9223191974382569971
1803505412,-9222950928798855459
1803603535,-9222783416682807621
1808506900,-9222758602401798041
1805531330,-9222636742915245241
1807853373,-9222324670859328253
#hadoop fs -put iteblog3.csv /tmp/
cc.sql(s""load data local inpath 'hdfs://192.169.77.224:8020/tmp/iteblog.csv' into table iteblog 
options('DELIMITER'=',', 'FILEHEADER'='id,hash')")
scala> cc.sql("select * from iteblog").show
+----------+--------------------+
| id| hash|
+----------+--------------------+
|1761060630| 1507780651275746626|
|1777010203|-6420079594732250962|
|1777080884|-3720484624594970761|
|1777080885| 6953598621328551083|
|1794379845| 4443483533807209950|
|1794419628|-3898139641996026768|
|1794522657| 5721419051907524948|
|1796358316|-3848539843796297096|
|1796361951| 2673643446784761880|
|1796363022| 7081835511530066760|
|1797689090| 7687516489507000693|
|1798032763| 8729543868018607114|
|1798032765|-2073004072970288002|
|1798933651| 4359618602910211713|
|1799173523| 3862252443910602052|
|1799555536|-2751011281327328990|
|1799569121| 1024477364392595538|
|1799608637| 4403346642796222437|
|1799745227|-2719506638749624471|
|1799859723| 5552725300397463239|
+----------+--------------------+
only showing top 20 rows
scala> cc.sql("select count(*) from iteblog").show
+-------+
| _c0|
+-------+
|7230338|
+-------+
scala> cc.sql("select count(distinct id) from iteblog").show
+-------+
| _c0|
+-------+
|6031231|
+-------+
阅读全文
0 0