Week 9 Lecture Notes
来源:互联网 发布:java web面试题 编辑:程序博客网 时间:2024/05/16 17:04
from https://www.coursera.org/learn/machine-learning
ML:Anomaly Detection
Problem Motivation
Just like in other learning problems, we are given a dataset
We are then given a new example,
We define a "model" p(x) that tells us the probability the example is not anomalous. We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.
A very common application of anomaly detection is detecting fraud:
x(i)= features of user i's activities- Model p(x) from the data.
- Identify unusual users by checking which have p(x)<ϵ.
If our anomaly detector is flagging too many anomalous examples, then we need todecrease our threshold ϵ
Gaussian Distribution
The Gaussian Distribution is a familiar bell-shaped curve that can be described by a function
Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance
The little ∼ or 'tilde' can be read as "distributed as."
The Gaussian Distribution is parameterized by a mean and a variance.
Mu, or μ, describes the center of the curve, called the mean. The width of the curve is described by sigma, or σ, called the standard deviation.
The full function is as follows:
We can estimate the parameter μ from a given dataset by simply taking the average of all the examples:
We can estimate the other parameter,
Algorithm
Given a training set of examples,
In statistics, this is called an "independence assumption" on the values of the features inside training example x.
More compactly, the above expression can be written as follows:
The algorithm
Choose features
Fit parameters
Calculate
Calculate
Given a new example x, compute p(x):
Anomaly if p(x)<ϵ
A vectorized version of the calculation for μ is
Developing and Evaluating an Anomaly Detection System
To evaluate our learning algorithm, we take some labeled data, categorized into anomalous and non-anomalous examples ( y = 0 if normal, y = 1 if anomalous).
Among that data, take a large proportion of good, non-anomalous data for the training set on which to train p(x).
Then, take a smaller proportion of mixed anomalous and non-anomalous examples (you will usually have many more non-anomalous examples) for your cross-validation and test sets.
For example, we may have a set where 0.2% of the data is anomalous. We take 60% of those examples, all of which are good (y=0) for the training set. We then take 20% of the examples for the cross-validation set (with 0.1% of the anomalous examples) and another 20% from the test set (with another 0.1% of the anomalous).
In other words, we split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50 between the CV and test sets.
Algorithm evaluation:
Fit model p(x) on training set
On a cross validation/test example x, predict:
If p(x) < ϵ (anomaly), then y=1
If p(x) ≥ ϵ (normal), then y=0
Possible evaluation metrics (see "Machine Learning System Design" section):
- True positive, false positive, false negative, true negative.
- Precision/recall
F1 score
Note that we use the cross-validation set to choose parameter ϵ
Anomaly Detection vs. Supervised Learning
When do we use anomaly detection and when do we use supervised learning?
Use anomaly detection when...
- We have a very small number of positive examples (y=1 ... 0-20 examples is common) and a large number of negative (y=0) examples.
- We have many different "types" of anomalies and it is hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.
Use supervised learning when...
- We have a large number of both positive and negative examples. In other words, the training set is more evenly divided into classes.
- We have enough positive examples for the algorithm to get a sense of what new positives examples look like. The future positive examples are likely similar to the ones in the training set.
Choosing What Features to Use
The features will greatly affect how well your anomaly detection algorithm works.
We can check that our features are gaussian by plotting a histogram of our data and checking for the bell-shaped curve.
Some transforms we can try on an example feature x that does not have the bell-shaped curve are:
- log(x)
- log(x+1)
- log(x+c) for some constant
x√ x1/3
We can play with each of these to try and achieve the gaussian shape in our data.
There is an error analysis procedure for anomaly detection that is very similar to the one in supervised learning.
Our goal is for p(x) to be large for normal examples and small for anomalous examples.
One common problem is when p(x) is similar for both types of examples. In this case, you need to examine the anomalous examples that are giving high probability in detail and try to figure out new features that will better distinguish the data.
In general, choose features that might take on unusually large or small values in the event of an anomaly.
Multivariate Gaussian Distribution (Optional)
The multivariate gaussian distribution is an extension of anomaly detection and may (or may not) catch more anomalies.
Instead of modeling
The important effect is that we can model oblong gaussian contours, allowing us to better fit data that might not fit into the normal circular contours.
Varying Σ changes the shape, width, and orientation of the contours. Changing μ will move the center of the distribution.
Check also:
- The Multivariate Gaussian Distributionhttp://cs229.stanford.edu/section/gaussians.pdf Chuong B. Do, October 10, 2008.
Anomaly Detection using the Multivariate Gaussian Distribution (Optional)
When doing anomaly detection with multivariate gaussian distribution, we compute μ and Σ normally. We then compute p(x) using the new formula in the previous section and flag an anomaly if p(x) < ϵ.
The original model for p(x) corresponds to a multivariate Gaussian where the contours of
The multivariate Gaussian model can automatically capture correlations between different features of x.
However, the original model maintains some advantages: it is computationally cheaper (no matrix to invert, which is costly for large number of features) and it performs well even with small training set size (in multivariate Gaussian model, it should be greater than the number of features for Σ to be invertible).
ML:Recommender Systems
Problem Formulation
Recommendation is currently a very popular application of machine learning.
Say we are trying to recommend movies to customers. We can use the following definitions
nu= number of usersnm= number of moviesr(i,j)=1 if user j has rated movie iy(i,j)= rating given by user j to movie i (defined only if r(i,j)=1)
Content Based Recommendations
We can introduce two features,
One approach is that we could do linear regression for every single user. For each user j, learn a parameter
θ(j)= parameter vector for user jx(i)= feature vector for movie i
For user j, movie i, predicted rating:
m(j)= number of movies rated by user j
To learn
This is our familiar linear regression. The base of the first summation is choosing all i such that
To get the parameters for all our users, we do the following:
We can apply our linear regression gradient descent update using the above cost function.
The only real difference is that we eliminate the constant
Collaborative Filtering
It can be very difficult to find features such as "amount of romance" or "amount of action" in a movie. To figure this out, we can usefeature finders.
We can let the users tell us how much they like the different genres, providing their parameter vector immediately for us.
To infer the features from given parameters, we use the squared error function with regularization over all the users:
You can also randomly guess the values for theta to guess the features repeatedly. You will actually converge to a good set of features.
Collaborative Filtering Algorithm
To speed things up, we can simultaneously minimize our features and our parameters:
It looks very complicated, but we've only combined the cost function for theta and the cost function for x.
Because the algorithm can learn them itself, the bias units where x0=1 have been removed, therefore x∈ℝn and θ∈ℝn.
These are the steps in the algorithm:
- Initialize
x(i),...,x(nm),θ(1),...,θ(nu) to small random values. This serves to break symmetry and ensures that the algorithm learns featuresx(i),...,x(nm) that are different from each other. - Minimize
J(x(i),...,x(nm),θ(1),...,θ(nu)) using gradient descent (or an advanced optimization algorithm).E.g. for everyj=1,...,nu,i=1,...nm :x(i)k:=x(i)k−α⎛⎝∑j:r(i,j)=1((θ(j))Tx(i)−y(i,j))θ(j)k+λx(i)k⎞⎠ θ(j)k:=θ(j)k−α⎛⎝∑i:r(i,j)=1((θ(j))Tx(i)−y(i,j))x(i)k+λθ(j)k⎞⎠ - For a user with parameters θ and a movie with (learned) features x, predict a star rating of
θTx .
Vectorization: Low Rank Matrix Factorization
Given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by:
Predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x. Specifically, we are looking for a small value of
Implementation Detail: Mean Normalization
If the ranking system for movies is used from the previous lectures, then new users (who have watched no movies), will be assigned new movies incorrectly. Specifically, they will be assigned θ with all components equal to zero due to the minimization of the regularization term. That is, we assume that the new user will rank all movies 0, which does not seem intuitively correct.
We rectify this problem by normalizing the data relative to the mean. First, we use a matrix Y to store the data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column corresponds to the ratings for the jth user.
We can now define a vector
such that
Which is effectively the mean of the previous ratings for the ith movie (where only movies that have been watched by users are counted). We now can normalize the data by subtracting u, the mean rating, from the actual ratings for each user (column in matrix Y):
As an example, consider the following matrix Y and mean ratings μ:
The resulting Y′ vector is:
Now we must slightly modify the linear regression prediction to include the mean normalization term:
Now, for a new user, the initial predicted values will be equal to the μ term instead of simply being initialized to zero, which is more accurate.
- Week 9 Lecture Notes
- CS229 Lecture notes 1
- Lecture Notes: Macros
- Scipy Lecture Notes
- MLDS Lecture Notes
- CS229 Lecture notes
- MLDS Lecture Notes Ⅱ
- Lecture Notes on Static Analysis
- MIT Computer Graphics Lecture Notes
- Lecture Notes in Computer Science
- R1 Lecture 02 Class Notes
- R1 Lecture 04 Class Notes
- R1 Lecture 05 Class Notes
- R1 Lecture 06 Class Notes
- R1 Lecture 07 Class Notes
- R1 Lecture 08 Class Notes
- R1 Lecture 09 Class Notes
- R1 Lecture 10 Class Notes
- Codeforces 785D
- 【CUGBACM15级BC第25场 A】hdu 5154 Harry and Magical Computer
- 局域网其他电脑不能连接SVN服务器
- 【拜小白opencv】22-自适应阈值化操作:adaptiveThreshold()函数
- Python+正则表达式编写多线程百度贴吧网页爬虫
- Week 9 Lecture Notes
- ~APTX4869
- java 线程池
- Input类
- PostgreSQL 9种索引的原理和应用场景
- mysql--6.多表查询
- Docker的各种概念
- k近邻法
- java 中 equals 与 == 的区别