Spark Streaming+Kafka+Hive+JSON实时增量计算示例

来源:互联网 发布:caffe的test iter 编辑:程序博客网 时间:2024/06/06 14:08
业务架构:
JavaScript -> Netty -> Kafka -> Spark Streaming + Hive -> Redis -> PHP
1.JavaScript作为统计脚本发送后端服务器
2.Netty用来接收请求,生成用户标识,过滤数据,将原始数据JSON化后写入Kafka
3.Spark Streaming采用Direct Approach (No Receivers)方式从Kafka中获取数据,通过Hive对foreachRDD进行处理,将最终结果写入Redis
4.PHP读取Redis中的统计结果,根据产品需求执行相关推荐业务

部署环境:
CentOS-6.7-x86_64 2.6.32-573.22.1.el6.x86_64
jdk1.8.0_77
spark-1.6.1
scala-2.10.6(spark-1.6.1要求版本)
hadoop-2.7.2(测试时,spark-1.6.1默认为hadoop-2.6.x,之上版本需要自己编译)
kafka_2.10-0.9.0.1(需要对应scala版本)

开发环境:
Windows 10 专业版
myeclipse-2015-stable-1.0
jdk1.7.0_80

节点分布:
192.168.163.141 CoS6-Node1
192.168.163.136 CoS6-Node2
192.168.163.137 CoS6-Node3

1.启动Zookeeper:/opt/zookeeper-3.4.8/bin/zkServer.sh start * 3
2.启动Yarn:/opt/hadoop-2.7.2/sbin/start-all.sh * 1
3.启动ZKFC:/opt/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc * 2
4.启动Spark:/opt/spark-1.6.1/sbin/start-all.sh * 1
5.启动Kafka:/opt/kafka_2.10-0.9.0.1/bin/kafka-server-start.sh /opt/kafka_2.10-0.9.0.1/config/server.properties>/data/kafka/run.out & * 3


相关依赖:
    <dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.10</artifactId>
            <version>0.8.2.1</version>
        </dependency>
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty-all</artifactId>
            <version>4.0.36.Final</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.2</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>1.6.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>1.6.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.10</artifactId>
            <version>1.6.1</version>
        </dependency>
        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>2.8.1</version>
        </dependency>
    </dependencies>

JVM参数:
-Dhadoop.home.dir="D:\\Workspaces\\MyEclipse Professional 2014\\hadoop-common-2.2.0-bin"
-Dhive.exec.scratchdir="C:\\Users\\Ouyang\\AppData\\Local\\Temp\\hive"
-XX:PermSize=128M   
-XX:MaxPermSize=4096M

示例代码:
import java.util.Arrays;import java.util.HashMap;import java.util.HashSet;import java.util.Map;import java.util.Set;import kafka.serializer.StringDecoder;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.function.VoidFunction;import org.apache.spark.sql.DataFrame;import org.apache.spark.sql.hive.HiveContext;import org.apache.spark.streaming.Durations;import org.apache.spark.streaming.api.java.JavaPairInputDStream;import org.apache.spark.streaming.api.java.JavaStreamingContext;import org.apache.spark.streaming.kafka.KafkaUtils;public class SparkJson {    public static void main(String[] args) {        Configuration config = Configuration.getInstance();        SparkConf conf = new SparkConf().setMaster(config.getProperty("master")).setAppName(config.getProperty("app.name"));        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.milliseconds(500));        final HiveContext sqlContext = new HiveContext(jssc.sparkContext());        Set<String> topicsSet = new HashSet<String>(Arrays.asList(config.getProperty("topic.json")));        Map<String, String> kafkaParams = new HashMap<String, String>();        kafkaParams.put("metadata.broker.list", config.getProperty("metadata.broker.list"));        // Create direct kafka stream with brokers and topics        JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(            jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet        );        // Get the lines, load to sqlContext        messages.foreachRDD(new VoidFunction<JavaPairRDD<String,String>>() {            private static final long serialVersionUID = 1L;            public void call(JavaPairRDD<String, String> t) throws Exception {                if(t.count() < 1) return ;                DataFrame df = sqlContext.read().json(t.values());                df.show();            }        });        // Start the computation        jssc.start();        jssc.awaitTermination();    }}

问题及解决方案:
1.缺少winutils.exe文件
异常内容:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
    at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)
    at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)
    at org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:730)
    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:215)
    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
    at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
    at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
解决方案:
①下载hadoop-common依照BUILDING.txt进行编译,Windows下依赖zlib和Windows SDK,过程相对麻烦。
项目地址:https://github.com/apache/hadoop-common
②可在github搜索winutils.exe,直接下载别人编码好的文件。
示例地址:https://github.com/srccodes/hadoop-common-2.2.0-bin
③配置HADOOP_HOME,将全部文件拷贝到bin目录下,此处不需要完整安装Hadoop。
④若HADOOP_HOME无效,可在JVM参数中指定:
-Dhadoop.home.dir="D:\\Workspaces\\MyEclipse Professional 2014\\hadoop-common-2.2.0-bin"

2.默认hive.exec.scratchdir目录 /tmp/hive在Windows无法对应到有效的磁盘分区。
异常内容:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
    at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
    at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
    at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
    at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
    ... 12 more
解决方案:
①通过跟踪分析,HiveConf会读取SparkConf中的相关配置,从rootHDFSDirPath入手会简单很多。
------org.apache.hadoop.hive.ql.session.SessionState
  /**
   * Create the root scratch dir on hdfs (if it doesn't already exist) and make it writable
   * @param conf
   * @return
   * @throws IOException
   */
  private Path createRootHDFSDir(HiveConf conf) throws IOException {
    Path rootHDFSDirPath = new Path(HiveConf.getVar(conf, HiveConf.ConfVars.SCRATCHDIR));
    FsPermission writableHDFSDirPermission = new FsPermission((short)00733);
    FileSystem fs = rootHDFSDirPath.getFileSystem(conf);

------org.apache.hadoop.fs.FileSystem
  public static URI getDefaultUri(Configuration conf) {
    return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
  }

------org.apache.hadoop.fs.CommonConfigurationKeysPublic
 /** See <a href="{@docRoot}/../core-default.html">core-default.xml</a> */
  public static final String  FS_DEFAULT_NAME_KEY = "fs.defaultFS";
  /** Default value for FS_DEFAULT_NAME_KEY */
  public static final String  FS_DEFAULT_NAME_DEFAULT = "file:///";
②配置hive.exec.scratchdir对应的JVM参数:
-Dhive.exec.scratchdir="C:\\Users\\Ouyang\\AppData\\Local\\Temp\\hive"

3.内存溢出OutOfMemoryError: PermGen space
异常内容:
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
    at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
    at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
    at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
    at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
Caused by: java.lang.OutOfMemoryError: PermGen space
解决方案:
①增加JVM参数:
-XX:PermSize=128M   
-XX:MaxPermSize=4096M

3.kafka版本不兼容
异常内容:
pom.xml
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.10</artifactId>
            <version>0.9.0.1</version>
        </dependency>
Exception in thread "main" java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
    at scala.Option.map(Option.scala:145)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
    at scala.util.Either$RightProjection.flatMap(Either.scala:523)
    at org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
    at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
    at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
    at scala.util.Either$RightProjection.flatMap(Either.scala:523)
    at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
    at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:34)
解决方案:
①参考官方文档,查找spark-streaming兼容kafka版本。
文档地址:http://spark.apache.org/docs/latest/streaming-programming-guide.html
Kafka: Spark Streaming 1.6.1 is compatible with Kafka 0.8.2.1. See the Kafka Integration Guide for more details.
②修改pom.xml对应version,服务器环境可保持高版本不变。

4.fasterxml.jackson版本不兼容
异常内容:
pom.xml
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.7.4</version>
        </dependency>
Exception in thread "main" java.lang.VerifyError: class com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final method withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at com.fasterxml.jackson.module.scala.ser.IteratorSerializerModule$class.$init$(IteratorSerializerModule.scala:70)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:19)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:35)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
    at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:81)
    at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
    at org.apache.spark.streaming.dstream.InputDStream.<init>(InputDStream.scala:78)
    at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.<init>(DirectKafkaInputDStream.scala:56)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:485)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
    at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:34)
解决方案:
①fasterxml.jackson在高版本中修改对应方法的修饰符,通过查找对应方法的发布日志确认最终兼容版本。
②修改pom.xml对应version。
0 0