Breakout detection in the wild
来源:互联网 发布:wordpress商城源码 编辑:程序博客网 时间:2024/06/05 18:08
Nowadays, BigData is leveraged in every sphere of business: decision making for new products, gauging user engagement, making recommendations for products, health care, data center efficiency and more.
A common form of BigData is time series data. With the progressively decreasing costs of collecting and mining large data sets, it’s become increasingly common that companies – including Twitter – collect millions of metrics on a daily basis [1,2, 3].
Exogenic and/or endogenic factors often give rise to breakouts in a time series. Breakouts can potentially have ramifications on the user experience and/or on a business’ bottom line. For example, in the context of cloud infrastructure, breakouts in time series data of system metrics – that may happen due to a hardware issues – could impact availability and performance of a service.
Given the real-time nature of Twitter, and that high performance is key for delivering the best experience to our users, early detection of breakouts is of paramount importance. Breakout detection has also been used to detect change in user engagement during popular live events such as the Oscars, Super Bowl and World Cup.
A breakout is typically characterized by two steady states and an intermediate transition period. Broadly speaking, breakouts have two flavors:
- Mean shift: A sudden jump in the time series corresponds to a mean shift. A sudden jump in CPU utilization from 40% to 60% would exemplify a mean shift.
- Ramp up: A gradual increase in the value of the metric from one steady state to another constitutes a ramp up. A gradual increase in CPU utilization from 40% to 60% would exemplify a ramp up.
The figure below illustrates multiple mean shifts in real data.
Given the ever-growing number of metrics being collected, it’s imperative to automatically detect breakouts. Although a large body of research already exists on breakout detection, existing techniques are not suitable for detecting breakouts in cloud data. This can be ascribed to the fact that existing techniques are not robust in the presence of anomalies (which are not uncommon in cloud data).
Today, we’re excited to announce the release of BreakoutDetection, an open-source R package that makes breakout detection simple and fast. With its release, we hope that the community can benefit from the package as we have at Twitter and improve it over time.
Our main motivation behind creating the package has been to develop a technique to detect breakouts which are robust, from a statistical standpoint, in the presence of anomalies. The BreakoutDetection package can be used in wide variety of contexts. For example, detecting breakout in user engagement post an A/B test, detecting behavioral change, or for problems in econometrics, financial engineering, political and social sciences.
How the package works
The underlying algorithm – referred to as E-Divisive with Medians (EDM) – employs energy statistics to detect divergence in mean. Note that EDM can also be used detect change in distribution in a given time series. EDM usesrobust statistical metrics, viz., median, and estimates the statistical significance of a breakout through a permutation test.
In addition, EDM is non-parametric. This is important since the distribution of production data seldom (if at all) follows the commonly assumed normal distribution or any other widely accepted model. Our experience has been that time series often contain more than one breakout. To this end, the package can also be used to detect multiple breakouts in a given time series.
How to get started
Install the R package using the following commands on the R console:
install.packages(
"devtools"
)
devtools::install_github(
"twitter/BreakoutDetection"
)
library(BreakoutDetection)
The function breakout is called to detect one or more statistically significant breakouts in the input time series. The documentation of the function breakout, which can be seen by using the following command, details the input arguments and the output of the function breakout.
help(breakout)
A simple example
To get started, the user is recommended to use the example dataset which comes with the packages. Execute the following commands:
data(Scribe)
res = breakout(Scribe,
min
.
size
=24, method=
'multi'
, beta=.001, degree=1, plot=
TRUE
)
res$plot
The above yields the following plot:
From the above plot, we observe that the input time series experiences a breakout and also has quite a few anomalies. The two red vertical lines denote the locations of the breakouts detected by the EDM algorithm. Unlike the existing approaches mentioned earlier, EDM is robust in the presence of anomalies. The change in mean in the time series can be better viewed with the following annotated plot:
The horizontal lines in the annotated plot above correspond to the approximate (i.e., filtering out the effect of anomalies) mean for each window.
Acknowledgements
We thank James Tsiamis and Scott Wong for their support and Nicholas James as the primary researcher behind this research.
- Breakout detection in the wild
- Breakout detection in the wild
- Breakout detection in the wild
- 论文见解之SRN: Side-output Residual Network for Object Symmetry Detection in the Wild
- 快速多尺度人脸检测2--Multi-Scale Fully Convolutional Network for Face Detection in the Wild
- Basic skills survive in the wild
- Regular Expression Matching in the Wild
- Regular Expression Matching in the Wild(中文)
- Deep Learning Face Attributes in the Wild
- Deep Learning Face Attributes in the Wild
- Pose-Aware Face Recognition in the Wild
- Deep Learning Face Attributes in the Wild
- Sandbox Evasion with COM by Malware in-the-Wild
- 读论文之ObjectProposals for Text Extraction in the Wild
- Support in the Wild: My Biggest Elasticsearch Problem at Scale
- Labeled Faces in the Wild 人脸识别数据集
- 人脸识别 - Pose-Aware Face Recognition in the Wild
- 论文笔记之---Person Re-identification in the Wild
- 树莓派开发系列教程6——树莓派做wifi热点
- IOS程序中打开自己的程序,测试机器中是否装了自己的程序(openURL:,canOpenURL:)
- 敖不可长
- epoll模型添加inotify事件的代码实现
- 逻辑&&和逻辑||的短路问题
- Breakout detection in the wild
- hadoop启动后jps没有namenode
- 树莓派开发系列教程7——树莓派做web服务器(nginx、Apache)
- Linux - FFmpeg 安装和使用
- app上传至AppStore记录 及更新版本上传
- PHP Thread Safe and Non Thread Safe
- Staubli实验注意事项
- Java设置大数非科学计数法显示
- SDL的使用