A Hadoop data pipeline to analyze applicaction performance
来源:互联网 发布:电脑本地网络没有网关 编辑:程序博客网 时间:2024/04/29 05:49
1. Introduction
In recent years, Hadoop has been under the spotlight for its flexible and scalable architecture to store and process big data on commodity machines. One of its common use cases is to analyze application log files, as the size of log files generated by applications keeps increasing (volume) and log files are often unstructured (variety).
In this project, we have built a data pipeline to analyze application performance based on application performance data (appperfdata) extracted from log files and database performance data (db2perfdata) extracted from DBAU database. XXX is used as a sample application to analyze, but it can be tailored to analyze other applications as well.
In this sample use case, appperdata is the duration of restful APIs. For example, from below appperfdata, we can know how many milliseconds the restful API took to execute. In this example, it costs 283 miliseconds to complete API “/msqe/coreserverDEV2/webapp/maskingservice/needMasking/4964”. Therefore, by ordering the information based on API duration, we can see which restful APIs are poor performed and then do optimization accordingly.
2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964
On the other hand, from below db2perfdata, we can know how many select, read, update, insert, delete operations are performed at a certain time (currently it is collected at minute-level).
2012-12-14-06-01,3038,281910,383,365,0
By correlating appperfdata and db2perfdata based on timestamp, ‘2012-12-14-06-01’ in above example, we can verify if the root cause of API poor performance is caused by overwhelming database load.
2. System Design
2.1 Overview
A set of Hadoop ecosystem tools, as shown in Figure 2.1, are leveraged to build the data pipeline, which consists of three phases:
a) Data Collection: Flume is used to collect both log files and db2perfdata and push them into Hadoop.
b) Data Storage: HDFS is used to archive log files and Hbase is used to store db2perfdata.
c) Data Processing: Pig is used to do ETL on the log files, so that the unstructured logs can be converted into meaningful structured data. On the other hand, Oozie is used to trigger pig jobs based on time and data availability.
d) Data Analysis/Reporting: By using Hive, we are able to issue SQL queries to do analysis or reporting. One highlight is that we are able to join application performance metrics extracted from log files (stored in HDFS) and db2perfdata (stored in Hbase).
Figure 2.1 Architecture of the data pipeline
2.2 Data Collection
2.2.1 Flume
Flume is one of the most commonly used tool to push data into Hadoop. It is configured as client-collector mode: there are two clients triggered daily by an Autosys job and their corresponding collectors are running on a Hadoop VM as deamons. As shown in Figure 2.2, the data collection pipeline is consisted of following parts:
- Flume client source: while one Flume client “cat” Alcazar daily log file, the other invokes show_db2perfdata script provided by DBAU to get minute-level db2perfdata.
- Flume client sink: both of sinks are built-in Avro sink.
- Flume collector source: both of sources are built-in Avro source, which means clients and collectors are communicated via Avro protocol.
- Flume collector sink: while the log file collector leverages built-in hdfs-sink, the db2perfdata collector utilizes built-in hbase-sink. However, since we have custom format of sources, a custom Hbase serializer is needed.
2.3.1 HDFS
Owing to cheapness and scalability of HDFS, it is well suitable for archiving large number of raw log files. We use a path pattern “${nameNode}${alcazar _path}/${YEAR}/${MONTH}/${DAY}” to for the log archiving.
2.3.2 Hbase
Hbase, a column-based key/value store, adds real-time read-write capability to HDFS. It is known to be good at storing time-series and sparse data, which is well suitable for storing db2perfdata in this project, as db2perfdata is both time-series (minute-level, even second-level) and sparse (down to minute-level or second-level, there will be a large number of 0 value. It will cost nothing for Hbase to ‘store’ those 0 values). Because of its schemaless nature, we can create an Hbase table by only specifying table name and a family column name:
create ‘alcazarDbPerf’, ‘f1’
2.4 Data Processing
2.4.1 Oozie
As described in 2.2, to automate data collection, we schedule a daily Autosys job to kick off Flume clients. Similarly, to automate data processing, Oozie is used to schedule Pig jobs.
As shown in Figure 2.3, the Oozie pipeline consists of three jobs:
a) Coordinator job: it is both time-driven and data-driven. On one hand, it is configured to run daily at a certain time. On the other hand, if the log file directory “${nameNode}${alcazar _path}/${YEAR}/ ${MONTH}/${DAY}” doesn’t exist when the job is started, the job would get blocked and Oozie is responsible for polling the directory and re-start the job again.
b) Pig workflow job: it is going to be initialized by the coordinator job. It takes the input from log file archive directory and writes ETL output to a staging area.
c) Fs workflow job: it will be triggered after the Pig workflow job completes. It moves ETL output from the staging area to a pre-defined archive directory.
Pig is commonly used to build a data processing pipeline on Hadoop. As shown in Figure 2.4, a Pig script is developed to perform four phases of processing, so that structured appperfdata can be extracted from unstructured log file:
a) Filtering: it filters out irrelevant logs based on particular string matching. To some degree, it is similar to grep utility.
b) Field Extraction: it applies regular expressions to extract meaningful fields, such as date, time, url, etc. Two sets of fields are extracted, one for API start logs, the other for API end logs.
c) Joining: it joins the two sets of fields on url and session id.
d) Conversion: it is performed by a custom UDF(user-defined function), which calculates the duration and converts the date time into a special format.
2.5.1 Hive
Hive is known as a schema-on-read SQL-supportive system on Hadoop. Schema-on-read means we only need to specify a schema when reading the data. Thus, the data to be queried on can be stored in any format. In this project, while appperfdata is stored in a structured format, db2perfdata is stored in Hbase specific format.
For appperfdata, the DDL to create table is:
CREATE EXTERNAL TABLE alcazarPerf (startDate STRING, startTime STRING, duration INT, url STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
For db2perfdata, the DDL to create table is:
CREATE EXTERNAL TABLE hbase_alcazarDbPerf(time string, sel_count int, rd_count int, ins_count int, upd_count int, del_count int)
After creating tables in Hive, we are able to use SQL queries to do analysis or reporting. For example, to order application performance data by API duration, issue below query:
select * from alcazarPerf order by duration desc;
To join appsperfdata and db2perfdata, issue below query:
select * from alcazarperf join hbase_alcazarDbPerf on alcazarperf.startDate = hbase_alcazarDbPerf.time;
- A Hadoop data pipeline to analyze applicaction performance
- ARM: HOW TO ANALYZE A DATA ABORT EXCEPTION
- Collecting performance counters and using SQL Server to analyze the data
- How to Build a High-Performance Data Warehouse (简译)
- How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1)
- Some Powerful Python Tools to Analyze Data
- How To Build A Pipeline
- HOW TO Analyze ASP.NET Web Application Performance by Using the Performance Administration Tool
- Slow performance occurs when you copy data to a TCP server by using a Windows Sockets API program
- Slow performance occurs when you copy data to a TCP server by using a Windows Sockets API program
- How to Improve Query Performance - A Checklist
- A guide to analyzing Python performance
- A guide to analyzing Python performance
- Yak: A High-Performance Big-Data-Friendly Garbage Collector
- How to analyze dump
- How to run ST12 to do a performance trace
- Append Data To A File
- Using MATLAB to analyze principal components obtained from a molecular dynamics simulation of a pro
- 【Django】Django命令(Manager.py)
- oracle传递表空间
- cygwin安装方法
- 向上兼容与向下兼容
- Android 利用方向传感器实现 指南针
- A Hadoop data pipeline to analyze applicaction performance
- KMP字符串模式匹配详解
- OpenCV中的肤色检测
- scanf(),printf()和gets(),puts()在输入输出字符串时的区别
- linux驱动编写(入门)
- 在Android中可以使用字符串数组资源
- 分步学习Struts(六)标签库
- 做真实的自己
- C语言读取文件内容,按行读