spark sql on hive初探
来源:互联网 发布:win7 nls数据丢失 编辑:程序博客网 时间:2024/03/28 19:49
前一段时间由于shark项目停止更新,sql on spark拆分为两个方向,一个是spark sql on hive,另一个是hive on spark。hive on spark达到可用状态估计还要等很久的时间,所以打算试用下spark sql on hive,用来逐步替代目前mr on hive的工作。
当前试用的版本是spark1.0.0,如果要支持hive,必须重新进行编译,编译的命令有所变化
复制代码 写了段比较简单的代码
复制代码编译后export出jar文件,使用standalone模式,故采用java -cp的方式提交,提交之前需要将hive-site.xml文件copy到$SPARK_HOME/conf目录下
复制代码提交后会报异常
复制代码
解决办法是需要设置相关的环境变量,在spark-env.sh中设置
复制代码
修改过环境变量之后重新提交,继续报错
复制代码造成这个错误的原因就是spark程序无法加载到hive-site.xml,从而无法获取到远程metastore服务的地址,只能在本地的derby数据库中查找,自然找不到相关库表的元数据信息。spark sql实际上是通过实例化HiveConf类来加载hive-site.xml文件的,这个跟hive cli的方式是一致的,代码如下
复制代码使用java -cp提交的方式不能正确的设置环境变量,在1.0.0版本中,新增了使用spark-submit脚本进行提交的方式,后来改为使用此脚本来提交
复制代码
当前试用的版本是spark1.0.0,如果要支持hive,必须重新进行编译,编译的命令有所变化
- export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
- mvn -Pyarn -Phive -Dhadoop.version=2.3.0-cdh5.0.0 -DskipTests clean package
- val conf = new SparkConf().setAppName("SqlOnHive")
- val sc = new SparkContext(conf)
- val hiveContext = new HiveContext(sc)
- import hiveContext._
- hql("FROM tmp.test SELECT id limit 1").foreach(println)
- java -XX:PermSize=256M -cp /home/hadoop/hql.jar com.yintai.spark.sql.SqlOnHive spark://h031:7077
- java.lang.RuntimeException: Error in configuring object
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
- at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
- at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
- at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
- at org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:187)
- at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:181)
- at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
- at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
- at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
- at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
- at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
- at org.apache.spark.scheduler.Task.run(Task.scala:51)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
- at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
- at java.lang.Thread.run(Thread.java:662)
- Caused by: java.lang.reflect.InvocationTargetException
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
- at java.lang.reflect.Method.invoke(Method.java:597)
- at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
- ... 27 more
- Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
- at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
- at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175)
- at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
- ... 32 more
- Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
- at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
- at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
- ... 34 more
解决办法是需要设置相关的环境变量,在spark-env.sh中设置
- SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/path/to/your/hadoop-lzo/libs/native
- SPARK_CLASSPATH=$SPARK_CLASSPATH:/path/to/your/hadoop-lzo/java/libs
修改过环境变量之后重新提交,继续报错
- 14/07/23 10:25:19 ERROR RetryingHMSHandler: NoSuchObjectException(message:There is no database named tmp)
- at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:431)
- at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:441)
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
- at java.lang.reflect.Method.invoke(Method.java:597)
- at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
- at com.sun.proxy.$Proxy9.getDatabase(Unknown Source)
- at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:628)
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
- at java.lang.reflect.Method.invoke(Method.java:597)
- at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)
- at com.sun.proxy.$Proxy10.get_database(Unknown Source)
- at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:810)
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
- at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
- at java.lang.reflect.Method.invoke(Method.java:597)
- at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
- at com.sun.proxy.$Proxy11.getDatabase(Unknown Source)
- at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1139)
- at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1128)
- at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3479)
- at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
- at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
- at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
- at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
- at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
- at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
- at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:185)
- at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160)
- at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:249)
- at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:246)
- at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85)
- at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90)
- at com.yintai.spark.sql.SqlOnHive$.main(SqlOnHive.scala:20)
- at com.yintai.spark.sql.SqlOnHive.main(SqlOnHive.scala)
- 14/07/23 10:25:19 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: tmp
- at org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
- at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
- at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
- at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
- at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
- at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
- at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
- at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:185)
- at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160)
- at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:249)
- at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:246)
- at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85)
- at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90)
- at com.yintai.spark.sql.SqlOnHive$.main(SqlOnHive.scala:20)
- at com.yintai.spark.sql.SqlOnHive.main(SqlOnHive.scala)
- ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
- if (classLoader == null) {
- classLoader = HiveConf.class.getClassLoader();
- }
- hiveDefaultURL = classLoader.getResource("hive-default.xml");
- // Look for hive-site.xml on the CLASSPATH and log its location if found.
- hiveSiteURL = classLoader.getResource("hive-site.xml");
- /usr/lib/spark/bin/spark-submit --class com.yintai.spark.sql.SqlOnHive
- --master spark://h031:7077
- --executor-memory 1g
- --total-executor-cores 1
- /home/hadoop/hql.jar
这个脚本在提交过程中会设置SparkConf中的spark.executor.extraClassPath和spark.driver.extraClassPath属性,从而保证可以正确加载到所需的配置文件,到此测试成功。
目前spark sql on hive兼容了大部分hive的语法和UDF,在SQL解析的时候使用了Catalyst框架,作业在运行效率上高出hive很多,不过目前的版本还存在一些BUG,稳定性上会有一些问题,需要等到新的稳定版发布再进行进一步的测试。
参考资料
http://spark.apache.org/docs/1.0.0/sql-programming-guide.html
[url]http://hsiamin.com/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/[/url]
本文出自 “17的博客” 博客,请务必保留此出处http://xiaowuliao.blog.51cto.com/3681673/1441737
0 0
- spark sql on hive初探
- spark sql on hive
- Spark SQL on Hive配置
- spark sql on hive安装问题解析
- spark sql on hive笔记一
- spark sql on hive配置及其使用
- Spark SQL 与 Spark SQL on Hive 区别
- [Spark]Shark, Spark SQL, Hive on Spark以及SQL On Spark的未来
- Spark-SQL和Hive on Spark, SqlContext和HiveContext
- Shark, Spark SQL, Hive on Spark, 以及SQL on Apache Spark的未来
- hive on spark部署
- 试用Hive on Spark
- spark on hive 总结
- hive on spark demo
- hive on spark 编译
- spark on hive
- Hive on Spark解析
- Hive on Spark:起点
- 数据压缩,算术编码
- java.lang.ClassFormatError
- 黑马程序员——iOS 学前准备---了解Mac
- SQL stuff求集合
- Android监听电话状态
- spark sql on hive初探
- Android:国际化
- ubuntu下如何增加环境变量
- 位运算符
- Oracle 的 rownum
- json数据转化为对应的对象
- androidstudio学习之拆分项目v2ex Preference 、CkeckBoxPrefrence应用
- Unity3D中的工具类-Time类
- GetWindowRect、GetClientRect、ScreenToClient、ClientToScreen的区别