Flume 以twitter为source，kafka为channel，hdfs为sink，再用spark streaming 读kafka topic

来源：互联网发布：java带毫秒编辑：程序博客网时间：2024/06/06 03:49

Flume 以twitter为source，kafka为channel，hdfs为sink，再用spark streaming 读kafka topic

Flume的配置文件: kafka_twitter.conf

# Naming the components on the current agent.TwitterAgent.sources = TwitterTwitterAgent.channels = kafka-channelTwitterAgent.sinks = sink1# Describing/Configuring the sourceTwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.consumerKey = 7PPYKH38pXjxdTCMR2gW7idoZTwitterAgent.sources.Twitter.consumerSecret = JHaymz2hrb0E95AZBERRYDFPCLhewVdzCkVT1Ws1ZORh3uuOpJTwitterAgent.sources.Twitter.accessToken = 2853850382-G876Yy7oSiwFDL3KFiewSuZiIHqUS7BXQ5WOg2vTwitterAgent.sources.Twitter.accessTokenSecret = Y1tb155NjjJUaM8TNgA9E71GFseYGfZ8VyVEOjDJJ0CsPTwitterAgent.sources.Twitter.keywords = TrumpTwitterAgent.sources.Twitter.channels = kafka-channel# Describing/Configuring the sinkTwitterAgent.sinks.sink1.type = hdfsTwitterAgent.sinks.sink1.hdfs.path = hdfs://serveur-hadoop.hadoop.com:8020/user/ychen/kafka/%{topic}/%y-%m-%dTwitterAgent.sinks.sink1.hdfs.rollInterval = 5TwitterAgent.sinks.sink1.hdfs.rollSize = 0TwitterAgent.sinks.sink1.hdfs.rollCount = 0TwitterAgent.sinks.sink1.hdfs.fileType = DataStreamTwitterAgent.sinks.sink1.channel = kafka-channel# Describing/Configuring the channelTwitterAgent.channels.kafka-channel.type = org.apache.flume.channel.kafka.KafkaChannelTwitterAgent.channels.kafka-channel.capacity = 10000TwitterAgent.channels.kafka-channel.transactionCapacity = 100TwitterAgent.channels.kafka-channel.brokerList = serveur-hadoop.hadoop.com:9092TwitterAgent.channels.kafka-channel.topic = twitterTwitterAgent.channels.kafka-channel.zookeeperConnect = 147.135.135.51:2181TwitterAgent.channels.kafka-channel.parseAsFlumeEvent = true

spark streaming kafka_counting.py

import sysfrom pyspark import SparkContextfrom pyspark.streaming import StreamingContextfrom pyspark.streaming.kafka import KafkaUtilsimport jsonif __name__ == "__main__":           if len(sys.argv) != 3:        print "Usage: kafka_count.py <zk> <topic>"    exit(-1)    sc = SparkContext(appName="PythonStreamingKafkaWordCount")    sc.setLogLevel('WARN')    ssc = StreamingContext(sc, 5)    zkQuorum, topic = sys.argv[1:] # get host and topicname from the commend    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})    parsed = kvs.map(lambda x: json.loads(x[1]))    parsed.saveAsTextFiles("twitter_test_x")    parsed.count().map(lambda x:'Tweets in this batch:%s' %x).pprint()    ssc.start()    ssc.awaitTermination()

两个文件单独运行都没有问题，即单独运行flume的时候，程序很正常，而spark streaming读取手动输入信息的producer topic也是可以的。但是当我用spark streaming 连接twitter topic的时候，总是出现 unicodedecodeerror: ‘utf8’ can not decode byte….

第一反应是twitter里有些奇怪的字符，但是查看twitter文件挺正常的，而且pyspark 通过 KafkaUtils.createStream 直接读取topic，pyspark是没有地方可以encode的。查看flume，也没有什么地方异常，纠结了一天也找不到可以改的地方。

后来仔细读了kafka flume的文档，仔细研究了一下kafka_channel 的定义，其中有一项值得引起注意。

|parseAsFlumeEvent | true | : Set to true if a Flume source is writing to the channel and expects AvroDataums with the FlumeEvent schema (org.apache.flume.source.avro.AvroFlumeEvent) in the channel. Set to false if other producers are writing to the topic that the channel is using.

抱着试一试的想法就把这一项改了，改成了false，惊喜！然后我的pyspark程序就可以读kafka里的twitter topic了。但是谁能告诉我为什么呢！

不过channel好了，sink又出现问题了，说是没有timestamp，这个问题很好解决，在flume配置文件的hdfs sink里添加useLocalTimeStamp = true就好了。

还有一个问题是。在读取twitter的时候，用到了parsed = kvs.map(lambda x: json.loads(x[1])), 不懂这里为什么用的是x[1]？只写json.loads(x)就不行。

ps: 写pyspark的教程可以看这里。

getting-started-with-spark-streaming-with-python-and-kafka/

阅读全文

0 0