A Hadoop data pipeline to analyze applicaction performance

来源：互联网发布：电脑本地网络没有网关编辑：程序博客网时间：2024/04/29 05:49

1. Introduction

In recent years, Hadoop has been under the spotlight for its flexible and scalable architecture to store and process big data on commodity machines. One of its common use cases is to analyze application log files, as the size of log files generated by applications keeps increasing (volume) and log files are often unstructured (variety).

In this project, we have built a data pipeline to analyze application performance based on application performance data (appperfdata) extracted from log files and database performance data (db2perfdata) extracted from DBAU database. XXX is used as a sample application to analyze, but it can be tailored to analyze other applications as well.

In this sample use case, appperdata is the duration of restful APIs. For example, from below appperfdata, we can know how many milliseconds the restful API took to execute. In this example, it costs 283 miliseconds to complete API “/msqe/coreserverDEV2/webapp/maskingservice/needMasking/4964”. Therefore, by ordering the information based on API duration, we can see which restful APIs are poor performed and then do optimization accordingly.

2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964

On the other hand, from below db2perfdata, we can know how many select, read, update, insert, delete operations are performed at a certain time (currently it is collected at minute-level).
2012-12-14-06-01,3038,281910,383,365,0
By correlating appperfdata and db2perfdata based on timestamp, ‘2012-12-14-06-01’ in above example, we can verify if the root cause of API poor performance is caused by overwhelming database load.

2. System Design
2.1 Overview
A set of Hadoop ecosystem tools, as shown in Figure 2.1, are leveraged to build the data pipeline, which consists of three phases:
a) Data Collection: Flume is used to collect both log files and db2perfdata and push them into Hadoop.
b) Data Storage: HDFS is used to archive log files and Hbase is used to store db2perfdata.
c) Data Processing: Pig is used to do ETL on the log files, so that the unstructured logs can be converted into meaningful structured data. On the other hand, Oozie is used to trigger pig jobs based on time and data availability.
d) Data Analysis/Reporting: By using Hive, we are able to issue SQL queries to do analysis or reporting. One highlight is that we are able to join application performance metrics extracted from log files (stored in HDFS) and db2perfdata (stored in Hbase).

Figure 2.1 Architecture of the data pipeline

2.2 Data Collection
2.2.1 Flume
Flume is one of the most commonly used tool to push data into Hadoop. It is configured as client-collector mode: there are two clients triggered daily by an Autosys job and their corresponding collectors are running on a Hadoop VM as deamons. As shown in Figure 2.2, the data collection pipeline is consisted of following parts:

Flume client source: while one Flume client “cat” Alcazar daily log file, the other invokes show_db2perfdata script provided by DBAU to get minute-level db2perfdata.
Flume client sink: both of sinks are built-in Avro sink.
Flume collector source: both of sources are built-in Avro source, which means clients and collectors are communicated via Avro protocol.
Flume collector sink: while the log file collector leverages built-in hdfs-sink, the db2perfdata collector utilizes built-in hbase-sink. However, since we have custom format of sources, a custom Hbase serializer is needed.

2.3 Data Storage
2.3.1 HDFS
Owing to cheapness and scalability of HDFS, it is well suitable for archiving large number of raw log files. We use a path pattern “${nameNode}${alcazar _path}/${YEAR}/${MONTH}/${DAY}” to for the log archiving.

2.3.2 Hbase
Hbase, a column-based key/value store, adds real-time read-write capability to HDFS. It is known to be good at storing time-series and sparse data, which is well suitable for storing db2perfdata in this project, as db2perfdata is both time-series (minute-level, even second-level) and sparse (down to minute-level or second-level, there will be a large number of 0 value. It will cost nothing for Hbase to ‘store’ those 0 values). Because of its schemaless nature, we can create an Hbase table by only specifying table name and a family column name:
create ‘alcazarDbPerf’, ‘f1’

2.4 Data Processing
2.4.1 Oozie
As described in 2.2, to automate data collection, we schedule a daily Autosys job to kick off Flume clients. Similarly, to automate data processing, Oozie is used to schedule Pig jobs.

Figure 2.3 Oozie Pipeline

As shown in Figure 2.3, the Oozie pipeline consists of three jobs:
a) Coordinator job: it is both time-driven and data-driven. On one hand, it is configured to run daily at a certain time. On the other hand, if the log file directory “${nameNode}${alcazar _path}/${YEAR}/ ${MONTH}/${DAY}” doesn’t exist when the job is started, the job would get blocked and Oozie is responsible for polling the directory and re-start the job again.
b) Pig workflow job: it is going to be initialized by the coordinator job. It takes the input from log file archive directory and writes ETL output to a staging area.
c) Fs workflow job: it will be triggered after the Pig workflow job completes. It moves ETL output from the staging area to a pre-defined archive directory.

2.4.2 Pig
Pig is commonly used to build a data processing pipeline on Hadoop. As shown in Figure 2.4, a Pig script is developed to perform four phases of processing, so that structured appperfdata can be extracted from unstructured log file:
a) Filtering: it filters out irrelevant logs based on particular string matching. To some degree, it is similar to grep utility.
b) Field Extraction: it applies regular expressions to extract meaningful fields, such as date, time, url, etc. Two sets of fields are extracted, one for API start logs, the other for API end logs.
c) Joining: it joins the two sets of fields on url and session id.
d) Conversion: it is performed by a custom UDF(user-defined function), which calculates the duration and converts the date time into a special format.

2.5 Data Analysis/Reporting
2.5.1 Hive
Hive is known as a schema-on-read SQL-supportive system on Hadoop. Schema-on-read means we only need to specify a schema when reading the data. Thus, the data to be queried on can be stored in any format. In this project, while appperfdata is stored in a structured format, db2perfdata is stored in Hbase specific format.
For appperfdata, the DDL to create table is:
CREATE EXTERNAL TABLE alcazarPerf (startDate STRING, startTime STRING, duration INT, url STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/user/junz/logAnalysis/alcazar/logETLArchive';
For db2perfdata, the DDL to create table is:
CREATE EXTERNAL TABLE hbase_alcazarDbPerf(time string, sel_count int, rd_count int, ins_count int, upd_count int, del_count int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = "f1:Sel,f1:Rd,f1:Ins,f1:Upd,f1:Del") TBLPROPERTIES("hbase.table.name" = "alcazarDbPerf");
After creating tables in Hive, we are able to use SQL queries to do analysis or reporting. For example, to order application performance data by API duration, issue below query:
select * from alcazarPerf order by duration desc;
To join appsperfdata and db2perfdata, issue below query:
select * from alcazarperf join hbase_alcazarDbPerf on alcazarperf.startDate = hbase_alcazarDbPerf.time;