LearningStorm第8章（2）

来源：互联网发布：淘宝的甲醛测试仪准吗编辑：程序博客网时间：2024/05/22 10:57

新建maven项目kafkaLogProducer

以com.learningstorm为groupId kafkaLogProducer为artifactId

pom.xml文件；

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">  <modelVersion>4.0.0</modelVersion>  <groupId>com.learningstorm</groupId>  <artifactId>kafkaLogProducer</artifactId>  <version>0.0.1-SNAPSHOT</version>  <packaging>jar</packaging>  <name>kafkaLogProducer</name>  <url>http://maven.apache.org</url>  <properties>    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  </properties>  <dependencies>    <dependency>      <groupId>junit</groupId>      <artifactId>junit</artifactId>      <version>3.8.1</version>      <scope>test</scope>    </dependency>    <dependency><groupId>org.apache.kafka</groupId><artifactId>kafka_2.10</artifactId><version>0.8.0</version><exclusions><exclusion><groupId>com.sun.jdmk</groupId><artifactId>jmxtools</artifactId></exclusion><exclusion><groupId>com.sun.jmx</groupId><artifactId>jmxri</artifactId></exclusion></exclusions></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-slf4j-impl</artifactId><version>2.0-beta9</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-1.2-api</artifactId><version>2.0-beta9</version></dependency>  </dependencies>  <build><plugins><plugin><groupId>org.codehaus.mojo</groupId><artifactId>exec-maven-plugin</artifactId><version>1.2.1</version><executions><execution><goals><goal>exec</goal></goals></execution></executions><configuration><executable>java</executable>  <includeProjectDependencies>true</includeProjectDependencies><includePluginDependencies>false</includePluginDependencies><classpathScope>compile</classpathScope><mainClass>com.learningstorm.kafkaLogProducer.KafkaProducer</mainClass></configuration></plugin></plugins></build></project>

KafkaProducer类

package com.learningstorm.kafkaLogProducer;import java.io.BufferedReader;import java.io.FileInputStream;import java.io.InputStreamReader;import java.util.Properties;import kafka.javaapi.producer.Producer;import kafka.producer.KeyedMessage;import kafka.producer.ProducerConfig;public class KafkaProducer {    public static void main(String[] args) {        // Build the configuration required for connecting to Kafka        Properties props = new Properties();        // List of kafka borkers. Complete list of brokers is not required as        // the producer will auto discover the rest of the brokers.        props.put("metadata.broker.list", "localhost:9092");        // Serializer used for sending data to kafka. Since we are sending        // string,        // we are using StringEncoder.        props.put("serializer.class", "kafka.serializer.StringEncoder");        // We want acks from Kafka that messages are properly recived.        props.put("request.required.acks", "1");        // Create the producer instance        ProducerConfig config = new ProducerConfig(props);        Producer<String, String> producer = new Producer<String, String>(config);        try {        FileInputStream fstream = new FileInputStream("./src/main/resources/apache_test.log");        BufferedReader br = new BufferedReader(new InputStreamReader(fstream));        String strLine;        /* read log line by line */        while ((strLine = br.readLine()) != null) {            KeyedMessage<String, String> data = new KeyedMessage<String,                    String>("apache_log", strLine);             producer.send(data);        }        br.close();        fstream.close();        }catch (Exception e) {        //  throw new RuntimeException("Error occurred while persisting records : ");        }        // close the producer        producer.close();    }}

运行KafkaLogProducer

1,启动zookeeper

hadoop@moon:/usr/local/cloud/zookeeper-3.4.6$ ./bin/zkServer.sh start &[2] 11035hadoop@moon:/usr/local/cloud/zookeeper-3.4.6$ JMX enabled by defaultUsing config: /usr/local/cloud/zookeeper-3.4.6/bin/../conf/zoo.cfgStarting zookeeper ... STARTED

2,启动kafka:

./bin/kafka-server-start.sh config/server.properties &

3,创建topic

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication -factor 1 --partitions 1 --topic apache_logbin/kafka-topics.sh --create --zookeeper localhost:2181 --replication  1 --partitions 1 --topic apache_log

4,运行kafkaproducer

mvn compile exec:java

注意：这里并没有打包，我使用

mvn package

得到了一个jar包,使用java -jar kafkaLogProducer-0.0.1-SNAPSHOT.jar报错：
kafkaLogProducer-0.0.1-SNAPSHOT.jar中没有主清单属性，有人说：
打好jar包之后还需要更改清单文件的.打开生成的jar,里面有一个MANIFEST.MF的文件,把它打开.然后有一行Main-Class:这个是空的.要在后面加上你的class文件.比如你的文件是Exec.java编译后就是Exec.class.那么这里就写Exec.然后回车,注意一定要在名字后面有一个回车让光标到下一行.这样你生成的jar包才能找到你的主class去运行，以后再试。

5,客户端查看消息

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topicapache_log --from-beginning

4.19.162.143 - - [4-03-2011:06:20:31 -0500] "GET / HTTP/1.1"200 864 "http://www.adeveloper.com/resource.html" "Mozilla/5.0(Windows; U; Windows NT 5.1; hu-HU; rv:1.7.12) Gecko/20050919Firefox/1.0.7"4.19.162.152 - - [4-03-2011:06:20:31 -0500] "GET / HTTP/1.1"200 864 "http://www.adeveloper.com/resource.html" "Mozilla/5.0(Windows; U; Windows NT 5.1; hu-HU; rv:1.7.12) Gecko/20050919Firefox/1.0.7"4.20.73.15 - - [4-03-2011:06:20:31 -0500] "GET / HTTP/1.1" 200 864"http://www.adeveloper.com/resource.html" "Mozilla/5.0 (Windows;U; Windows NT 5.1; hu-HU; rv:1.7.12) Gecko/20050919 Firefox/1.0.7"4.20.73.32 - - [4-03-2011:06:20:31 -0500] "GET / HTTP/1.1" 200 864"http://www.adeveloper.com/resource.html" "Mozilla/5.0 (Windows;U; Windows NT 5.1; hu-HU; rv:1.7.12) Gecko/20050919 Firefox/1.0.7"

log处理

创建一个拓扑从使用kafkaspout从kafka读取数据，处理服务器日志文件，把数据存储在mysql

ApacheLogSplitter contains logic to fetch different elements such as ip , referrer ,
user-agent , and so on from the Apache log line

/*** This class contains logic to Parse an Apache logfile* with Regular Expressions*/public class ApacheLogSplitter {public Map<String,Object> logSplitter(String apacheLog) {String logEntryLine = apacheLog;// Regex pattern to split fetch// the different properties from log lines.String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+)\\[([\\w-:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\"(\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";[ 189 ]Log Processing with StormPattern p = Pattern.compile(logEntryPattern);Matcher matcher = p.matcher(logEntryLine);Map<String,Object> logMap =new HashMap<String, Object>();if (!matcher.matches() || 9 != matcher.groupCount()) {System.err.println("Bad log entry (or problem with RE?):");System.err.println(logEntryLine);return logMap;}// set the ip, dateTime, request, etc into map.logMap.put("ip", matcher.group(1));logMap.put("dateTime", matcher.group(4));logMap.put("request", matcher.group(5));logMap.put("response", matcher.group(6));logMap.put("bytesSent", matcher.group(7));logMap.put("referrer", matcher.group(8));logMap.put("useragent", matcher.group(9));return logMap;}

The input for the logSplitter(String apacheLog) method is as follows:98.83.179.51 - - [18/May/2011:19:35:08 -0700] \"GET /css/main.cssHTTP/1.1\" 200 1837 \"http://www.safesand.com/information.htm\"\"Mozilla/5.0 (Windows NT 6.0; WOW64; rv:2.0.1) Gecko/20100101Firefox/4.0.1\"The output of the logSplitter(String apacheLog) method is as follows:{response=200, referrer=http://www.safesand.com/information.htm, bytesSent=1837, useragent=Mozilla/5.0 (Windows NT 6.0;WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1, dateTime=18/May/2011:19:35:08 -0700, request=GET /css/main.css HTTP/1.1,ip=98.83.179.51}

ApacheLogSplitterBolt类继承backtype.storm.topology.base.BaseBasicBolt类execute()方法从KafkaSpout接收tuples(日志行)，然后调用ApacheLogSplitter的logSplitter(String apachelog)方法处理日志

public class ApacheLogSplitterBolt extends BaseBasicBolt {private static final long serialVersionUID = 1L;// Create the instance of the ApacheLogSplitter class.private static final ApacheLogSplitterapacheLogSplitter = new ApacheLogSplitter();private static final List<String> LOG_ELEMENTS =new ArrayList<String>();static {LOG_ELEMENTS.add("ip");LOG_ELEMENTS.add("dateTime");LOG_ELEMENTS.add("request");LOG_ELEMENTS.add("response");LOG_ELEMENTS.add("bytesSent");LOG_ELEMENTS.add("referrer");LOG_ELEMENTS.add("useragent");}public void execute(Tuple input, BasicOutputCollectorcollector) {// Get the Apache log from the tupleString log = input.getString(0);if (StringUtils.isBlank(log)) {// Ignore blank linesreturn;}// Call the logSplitter(String apachelog) method// of the ApacheLogSplitter class.Map<String, Object> logMap = apacheLogSplitter.logSplitter(log);List<Object> logdata = new ArrayList<Object>();for (String element : LOG_ELEMENTS) {logdata.add(logMap.get(element));}// emits set of fields (ip, referrer, user-agent,// bytesSent, and so on.)collector.emit(logdata);}public void declareOutputFields(OutputFieldsDeclarerdeclarer) {// Specify the name of output fields.declarer.declare(new Fields("ip", "dateTime","request", "response", "bytesSent", "referrer","useragent"));}}

The output of the ApacheLogSplitterBolt class contains seven fields.
These fields are ip , dateTime , request , response , bytesSent , referrer ,
and useragent .
识别国家名，操作系统类型，浏览器类型
下面是IpToCountryCoverter类，它包含参数构造器读取GeoLiteCity.dat文件的地址，这个文件必须在所有节点相同的位置上，它是我们用来给定ip时识别国家名的数据库

/*** This class contains logic to identify* the country name from the IP address*/public class IpToCountryConverter {private static LookupService cl = null;/*** A parameterized constructor which would take* the location of the GeoLiteCity.dat file as input.** @param pathTOGeoLiteCityFile*/public IpToCountryConverter(String pathTOGeoLiteCityFile) {try {cl = new LookupService("pathTOGeoLiteCityFile",LookupService.GEOIP_MEMORY_CACHE);} catch (Exception exception) {throw new RuntimeException("Error occurred while initializingIpToCountryConverter class: ");}}/*** This method takes the IP address of the input and* converts it into a country name.** @param ip* @return*/public String ipToCountry (String ip) {Location location = cl.getLocation(ip);if (location == null) {return "NA";}if (location.countryName == null) {return "NA";}return location.countryName;}

UserAgentTools类从useragent类中识别操作系统和浏览器类型
UserInformationGetterBolt类使用UserAgentTools和IpToCountry Converter类识别国家名，操作系统名和浏览器类型

/*** This class uses the IpToCountryConverter and* UserAgentTools classes to identify* the country, os, and browser from log line.*/public class UserInformationGetterBolt extends BaseRichBolt {private static final long serialVersionUID = 1L;private IpToCountryConverter ipToCountryConverter = null;private UserAgentTools userAgentTools = null;public OutputCollector collector;private String pathTOGeoLiteCityFile;public UserInformationGetterBolt(String pathTOGeoLiteCityFile) {// set the path of the GeoLiteCity.dat file.this.pathTOGeoLiteCityFile = pathTOGeoLiteCityFile;}public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("ip", "dateTime", "request","response","bytesSent", "referrer", "useragent", "country","browser","os"));}public void prepare(Map stormConf, TopologyContext context,OutputCollector collector) {this.collector = collector;this.ipToCountryConverter = new IpToCountryConverter(this.pathTOGeoLiteCityFile);this.userAgentTools = new UserAgentTools();}public void execute(Tuple input) {String ip = input.getStringByField("ip").toString();// Identify the country using the IP AddressObject country = ipToCountryConverter.ipToCountry(ip);// Identify the browser using useragent.Object browser = userAgentTools.getBrowser(input.getStringByField("useragent").toString())[1];// Identify the os using useragent.Object os = userAgentTools.getOS(input.getStringByField("useragent").toString())[1];collector.emit(new Values(input.getString(0),input.getString(1), input.getString(2),input.getString(3), input.getString(4),input.getString(5), input.getString(6),country, browser, os));}}

UserInformationGetterBolt类包含10个字段

ip,dateTime,request,response,byteSent,referrer,useragent,country,browser,os

KeywordGenerator类从referrerURL中抽取关键字

/*** This class takes the referrer URL as the input,* analyzes the URL and returns the* keyword to be searched as the output.*/public class KeywordGenerator {public String getKeyword(String referer) {String[] temp;Pattern pat = Pattern.compile("[?&#]q=([^&]+)");Matcher m = pat.matcher(referer);if (m.find()) {String searchTerm = null;searchTerm = m.group(1);temp = searchTerm.split("\\+");searchTerm = temp[0];for (int i = 1; i < temp.length; i++) {searchTerm = searchTerm + " " + temp[i];}return searchTerm;} else {pat = Pattern.compile("[?&#]p=([^&]+)");m = pat.matcher(referer);if (m.find()) {String searchTerm = null;searchTerm = m.group(1);temp = searchTerm.split("\\+");searchTerm = temp[0];for (int i = 1; i < temp.length; i++) {searchTerm = searchTerm + " " + temp[i];}return searchTerm;} else {//pat = Pattern.compile("[?&#]query=([^&]+)");m = pat.matcher(referer);if (m.find()) {String searchTerm = null;searchTerm = m.group(1);temp = searchTerm.split("\\+");searchTerm = temp[0];for (int i = 1; i < temp.length; i++) {searchTerm = searchTerm + " " + temp[i];}return searchTerm;} else {return "NA";}}}}}

The input for the KeywordGenerator class is as follows:
https://in.search.yahoo.com/search;_ylt=AqH0NZe1hgPCzVap0PdKk7Guit
IF?p=india+live+score&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-704
Then, the output of the KeywordGenerator class is as follows:
india live score

KeyWordIdentifierBolt类调用KeywordGenerator类抽取关键字

/*** This class uses the KeywordGenerator class* to extract the keyword from the referrer URL.*/public class KeyWordIdentifierBolt extends BaseRichBolt {private static final long serialVersionUID = 1L;private KeywordGenerator keywordGenerator = null;public OutputCollector collector;public KeyWordIdentifierBolt() {[ 197 ]Log Processing with Storm}public void declareOutputFields(OutputFieldsDeclarerdeclarer) {declarer.declare(new Fields("ip", "dateTime","request", "response", "bytesSent", "referrer","useragent", "country", "browser", "os","keyword"));}public void prepare(Map stormConf, TopologyContextcontext, OutputCollector collector) {this.collector = collector;this.keywordGenerator = new KeywordGenerator();}public void execute(Tuple input) {String referrer = input.getStringByField("referrer").toString();// Call the getKeyword(String referrer) method// of the KeywordGenerator class to// extract the keyword.Object keyword = keywordGenerator.getKeyword(referrer);// emits all the field emitted by previous bolt +// the keywordcollector.emit(new Values(input.getString(0),input.getString(1), input.getString(2),input.getString(3), input.getString(4),input.getString(5), input.getString(6),input.getString(7), input.getString(8),input.getString(9), keyword));}}

The output of the KeyWordIdentifierBolt class contains 11 fields.
These fields are ip , dateTime , request , response , bytesSent , referrer ,
useragent , country , browser , os , and keyword .
创建MySQLConnection类，返回MySQL连接

/*** This class returns the MySQL connection.*/public class MySQLConnection {private static Connection connect = null;/*** This method returns the MySQL connection.** @param ip*IP address of the MySQL server* @param database*name of the database* @param user*name of the user* @param password*password of the given user* @return MySQL connection*/public static Connection getMySQLConnection(String ip, String database, String user,String password) {try {// This will load the MySQL driver,// each DB has its own driverClass.forName("com.mysql.jdbc.Driver");// Set up the connection with the DB.[ 199 ]Log Processing with Stormconnect = DriverManager.getConnection("jdbc:mysql://"+ ip +"/"+database+"?"+"user="+user+"&password="+password+"");return connect;} catch (Exception e) {throw new RuntimeException("Error occurred whilegetting the MySQL connection: ");}}}

创建MySQLDump类，这个类有个参数构造器，需要ip,数据库名，用户名，mysql服务器密码。这个类调用getMySQLConnection(ip,database,user,password)方法获得MySQL连接，这个类的persistRecord(Tuple tuple)方法，它把tuples持久化到MySQL

/*** This class contains logic to persist the record* into the MySQL database.*/public class MySQLDump {/*** Name of database you want to connect*/private String database;/*** Name of MySQL user*/private String user;/*** IP of the MySQL server*/private String ip;/*** Password of the MySQL server*/private String password;public MySQLDump(String ip, String database,String user, String password) {this.ip = ip;this.database = database;this.user = user;this.password = password;}/*** Get the MySQL connection*/private Connection connect = MySQLConnection.getMySQLConnection(ip,database,user,password);private PreparedStatement preparedStatement = null;/*** Persist input tuple.* @param tuple*/public void persistRecord(Tuple tuple) {try {// preparedStatements can use variables and// are more efficientpreparedStatement = connect.prepareStatement("insert into apachelog values (default, ?, ?, ?,?, ?, ?, ?, ? , ?, ?, ?)");preparedStatement.setString(1,tuple. getStringByField("ip"));preparedStatement.setString(2,tuple.getStringByField("dateTime"));preparedStatement.setString(3,tuple.getStringByField("request"));preparedStatement.setString(4,tuple.getStringByField("response"));preparedStatement.setString(5,tuple.getStringByField("bytesSent"));preparedStatement.setString(6,tuple.getStringByField("referrer"));preparedStatement.setString(7,tuple.getStringByField("useragent"));preparedStatement.setString(8,tuple.getStringByField("country"));preparedStatement.setString(9,tuple.getStringByField("browser"));preparedStatement.setString(10,tuple.getStringByField("os"));preparedStatement.setString(11,tuple.getStringByField("keyword"));// Insert record[ 201 ]Log Processing with StormpreparedStatement.executeUpdate();} catch (Exception e) {throw new RuntimeException("Error occurred whilepersisting records in MySQL: ");} finally {// close prepared statementif (preparedStatement != null) {try {preparedStatement.close();} catch (Exception exception) {System.out.println("Error occurred whileclosing PreparedStatement:");}}}}public void close() {try {connect.close();}catch(Exception exception) {System.out.println("Error occurred while closingthe connection");}}}

创建PersistenceBolt类，它的参数构造器也需要那四个参数，
execute()方法调用MySQLDump的persistRecord方法把记录持久化到MySQL

public class PersistenceBolt implements IBasicBolt {private MySQLDump mySQLDump = null;private static final long serialVersionUID = 1L;/*** Name of the database you want to connect*/private String database;/*** Name of the MySQL user*/private String user;/*** IP address of the MySQL server*/private String ip;/*** Password of the MySQL server*/private String password;public PersistenceBolt(String ip, String database,String user, String password) {this.ip = ip;this.database = database;this.user = user;this.password = password;}public void declareOutputFields(OutputFieldsDeclarer declarer) {}public Map<String, Object> getComponentConfiguration() {return null;}public void prepare(Map stormConf,TopologyContext context) {// create the instance of the MySQLDump(....) class.mySQLDump = new MySQLDump(ip, database, user,password);}/*** This method calls the persistRecord(input) method* of the MySQLDump class to persist records into MySQL.*/public void execute(Tuple input,BasicOutputCollector collector) {System.out.println("Input tuple : " + input);mySQLDump.persistRecord(input);}public void cleanup() {// Close the connectionmySQLDump.close();}}

定义拓扑和kafka spout

public class LogProcessingTopology {public static void main(String[] args) throws Exception {// zookeeper hosts for the Kafka clusterZkHosts zkHosts = new ZkHosts("localhost:2181");// Create the KafkaSpout configuartion// Second argument is the topic name[ 205 ]Log Processing with Storm// Third argument is the zookeeper root for Kafka// Fourth argument is consumer group idSpoutConfig kafkaConfig = new SpoutConfig(zkHosts, "apache_log", "", "id");// Specify that the kafka messages are StringkafkaConfig.scheme = new SchemeAsMultiScheme(newStringScheme());// We want to consume all the first messages// in the topic every time we run the topology// to help in debugging. In production, this// property should be falsekafkaConfig.forceFromStart = true;// Now we create the topologyTopologyBuilder builder = new TopologyBuilder();// set the kafka spout classbuilder.setSpout("KafkaSpout", newKafkaSpout(kafkaConfig), 1);// set the LogSplitter, IpToCountry, Keyword,// and PersistenceBolt bolts// class.builder.setBolt("LogSplitter",new ApacheLogSplitterBolt(), 1).globalGrouping("KafkaSpout");builder.setBolt("IpToCountry",new UserInformationGetterBolt("./src/main/resources/GeoLiteCity.dat"), 1).globalGrouping("LogSplitter");builder.setBolt("Keyword", newKeyWordIdentifierBolt(), 1).globalGrouping("IpToCountry");builder.setBolt("PersistenceBolt",new PersistenceBolt("localhost", "apachelog","root", "root"), 1).globalGrouping("Keyword");if (args != null && args.length > 0) {// Run the topology on remote cluster.Config conf = new Config();conf.setNumWorkers(4);try {StormSubmitter.submitTopology(args[0], conf,builder.createTopology());} catch (AlreadyAliveException alreadyAliveException) {System.out.println(alreadyAliveException);} catch (InvalidTopologyExceptioninvalidTopologyException) {System.out.println(invalidTopologyException);}} else {// create an instance of the LocalCluster class// for executing the topology in the local mode.LocalCluster cluster = new LocalCluster();Config conf = new Config();// Submit topology for executioncluster.submitTopology("KafkaToplogy", conf,builder.createTopology());try {// Wait for some time before exitingSystem.out.println("**********************Waitingto consume from kafka");Thread.sleep(10000);} catch (Exception exception) {System.out.println("******************Threadinterrupted exception : " + exception);}// kill KafkaTopologycluster.killTopology("KafkaToplogy");// shut down the storm test clustercluster.shutdown();}}}

部署拓扑

1，创建mysql数据库

create database apachelog;use apachelog;create table apachelog(id INT NOT NULL AUTO_INCREMENT,ip VARCHAR(100) NOT NULL,dateTime VARCHAR(200) NOT NULL,request VARCHAR(100) NOT NULL,response VARCHAR(200) NOT NULL,bytesSent VARCHAR(200) NOT NULL,referrer VARCHAR(500) NOT NULL,useragent VARCHAR(500) NOT NULL,country VARCHAR(200) NOT NULL,browser VARCHAR(200) NOT NULL,os VARCHAR(200) NOT NULL,keyword VARCHAR(200) NOT NULL,PRIMARY KEY (id));

2，使用KafkaLogProducer project发数据到kafka

cd /opt/sts-bundle/workspace/kafkaLogProducermvn compile exec:java

3,构建项目：到项目根目录执行

/opt/sts-bundle/workspace/stormlogprocessingmvn clean install –DskipTests

4，启动拓扑

java -cp target/stormlogprocessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar:$STORM_HOME/storm-core-0.9.0.1.jar:$STORM_HOME/lib/* com.learningstorm.stormlogprocessing.LogProcessingTopology/path/to/GeoLiteCity.dat localhost apachelog root 123

5，到mysql中查看结果

select * from apachelog limit 2;

0 0