datascience之机器学习week1

来源：互联网发布：加qq群软件编辑：程序博客网时间：2024/05/21 12:06

About this course

This course covers the basic ideas behind machine learning/prediction
- Study design - training vs. test sets
- Conceptual issues - out of sample error, ROC curves
- Practical implementation - the caret package
What this course depends on
- The Data Scientist's Toolbox
- R Programming
What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research
- Regression models

Who predicts?

Local governments -> pension payments
Google -> whether you will click on an ad
Amazon -> what movies you will watch
Insurance companies -> what your risk of death is
Johns Hopkins -> who will succeed in their programs

Why predict? Glory!

http://www.zimbio.com/photos/Chris+Volinsky

Why predict? Riches!

http://www.heritagehealthprize.com/c/hhp

Why predict? For sport!

http://www.kaggle.com/

Why predict? To save lives!

http://www.oncotypedx.com/en-US/Home

A useful (if a bit advanced) book

The elements of statistical learning

A useful package

http://caret.r-forge.r-project.org/

Machine learning (more advanced material)

https://www.coursera.org/course/ml

Even more resources

List of machine learning resources on Quora
List of machine learning resources from Science
Advanced notes from MIT open courseware
Advanced notes from CMU
Kaggle - machine learning competitions

The central dogma of prediction

What can go wrong

http://www.sciencemag.org/content/343/6176/1203.full.pdf

Components of a predictor

question -> input data -> features -> algorithm -> parameters -> evaluation

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

http://rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?

Thanks,

Ben

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?

Thanks,

Ben

Frequency of you $= 2/17 = 0.118$

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

library(kernlab)data(spam)head(spam)

  make address  all num3d  our over remove internet order mail receive will people report addresses1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64   0.00   0.00      0.002 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79   0.65   0.21      0.143 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45   0.12   0.00      1.754 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.005 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.006 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00 0.00   0.00   0.00      0.00  free business email  you credit your font num000 money hp hpl george num650 lab labs telnet1 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00  0.00  0   0      0      0   0    0      02 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43  0.43  0   0      0      0   0    0      03 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16  0.06  0   0      0      0   0    0      04 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      05 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      06 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00  0.00  0   0      0      0   0    0      0  num857 data num415 num85 technology num1999 parts pm direct cs meeting original project   re  edu1      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.002      0    0      0     0          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.003      0    0      0     0          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.064      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.005      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.006      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00  table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar1     0          0          0.00            0.000                 0           0.778      0.0002     0          0          0.00            0.132                 0           0.372      0.1803     0          0          0.01            0.143                 0           0.276      0.1844     0          0          0.00            0.137                 0           0.137      0.0005     0          0          0.00            0.135                 0           0.135      0.0006     0          0          0.00            0.223                 0           0.000      0.000  charHash capitalAve capitalLong capitalTotal type1    0.000      3.756          61          278 spam2    0.048      5.114         101         1028 spam3    0.010      9.821         485         2259 spam4    0.000      3.537          40          191 spam5    0.000      3.537          40          191 spam6    0.000      3.000          15           54 spam

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

plot(density(spam$your[spam$type=="nonspam"]),     col="blue",main="",xlab="Frequency of 'your'")lines(density(spam$your[spam$type=="spam"]),col="red")

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Our algorithm

Find a value $C$.
frequency of 'your' $>$ C predict "spam"

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

plot(density(spam$your[spam$type=="nonspam"]),     col="blue",main="",xlab="Frequency of 'your'")lines(density(spam$your[spam$type=="spam"]),col="red")abline(v=0.5,col="black")

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

prediction <- ifelse(spam$your > 0.5,"spam","nonspam")table(prediction,spam$type)/length(spam$type)

prediction nonspam   spam   nonspam  0.4590 0.1017   spam     0.1469 0.2923

Accuracy$ \approx 0.459 + 0.292 = 0.751$

Relative order of importance

question > data > features > algorithms

An important point

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

John Tukey

Garbage in = Garbage out

question -> input data -> features -> algorithm -> parameters -> evaluation

May be easy (movie ratings -> new movie ratings)
May be harder (gene expression data -> disease)
Depends on what is a "good prediction".
Often more data > better models
The most important step!

Features matter!

question -> input data -> features -> algorithm -> parameters -> evaluation

Properties of good features

Lead to data compression
Retain relevant information
Are created based on expert application knowledge

Common mistakes

Trying to automate feature selection
Not paying attention to data-specific quirks
Throwing away information unnecessarily

May be automated with care

question -> input data -> features -> algorithm -> parameters -> evaluation

http://arxiv.org/pdf/1112.6209v5.pdf

Algorithms matter less than you'd think

question -> input data -> features -> algorithm

-> parameters -> evaluation

http://arxiv.org/pdf/math/0606441.pdf

Issues to consider

http://strata.oreilly.com/2013/09/gaining-access-to-the-best-machine-learning-methods.html

Prediction is about accuracy tradeoffs

Interpretability versus accuracy
Speed versus accuracy
Simplicity versus accuracy
Scalability versus accuracy

Interpretability matters

http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf

Scalability matters

http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/

http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

In sample versus out of sample

In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.

Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.

Key ideas

Out of sample error is what you care about
In sample error $<$ out of sample error
The reason is overfitting
- Matching your algorithm to the data you have

In sample versus out of sample errors

library(kernlab); data(spam); set.seed(333)smallSpam <- spam[sample(dim(spam)[1],size=10),]spamLabel <- (smallSpam$type=="spam")*1 + 1plot(smallSpam$capitalAve,col=spamLabel)

Prediction rule 1

capitalAve $>$ 2.7 = "spam"
capitalAve $<$ 2.40 = "nonspam"
capitalAve between 2.40 and 2.45 = "spam"
capitalAve between 2.45 and 2.7 = "nonspam"

Apply Rule 1 to smallSpam

rule1 <- function(x){  prediction <- rep(NA,length(x))  prediction[x > 2.7] <- "spam"  prediction[x < 2.40] <- "nonspam"  prediction[(x >= 2.40 & x <= 2.45)] <- "spam"  prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"  return(prediction)}table(rule1(smallSpam$capitalAve),smallSpam$type)

          nonspam spam  nonspam       5    0  spam          0    5

Prediction rule 2

capitalAve $>$ 2.40 = "spam"
capitalAve $\leq$ 2.40 = "nonspam"

Apply Rule 2 to smallSpam

rule2 <- function(x){  prediction <- rep(NA,length(x))  prediction[x > 2.8] <- "spam"  prediction[x <= 2.8] <- "nonspam"  return(prediction)}table(rule2(smallSpam$capitalAve),smallSpam$type)

          nonspam spam  nonspam       5    1  spam          0    4

Apply to complete spam data

table(rule1(spam$capitalAve),spam$type)

          nonspam spam  nonspam    2141  588  spam        647 1225

table(rule2(spam$capitalAve),spam$type)

          nonspam spam  nonspam    2224  642  spam        564 1171

mean(rule1(spam$capitalAve)==spam$type)

[1] 0.7316

mean(rule2(spam$capitalAve)==spam$type)

[1] 0.7379

Look at accuracy

sum(rule1(spam$capitalAve)==spam$type)

[1] 3366

sum(rule2(spam$capitalAve)==spam$type)

[1] 3395

What's going on?

Overfitting

Data have two parts
- Signal
- Noise
The goal of a predictor is to find signal
You can always design a perfect in-sample predictor
You capture both signal + noise when you do that
Predictor won't perform as well on new samples

http://en.wikipedia.org/wiki/Overfitting

Prediction study design

Define your error rate
Split data into:
- Training, Testing, Validation (optional)
On the training set pick features
- Use cross-validation
On the training set pick prediction function
- Use cross-validation
If no validation
- Apply 1x to test set(即只用一次)
If validation
- Apply to test set and refine
- Apply 1x to validation

Know the benchmarks

http://www.heritagehealthprize.com/c/hhp/leaderboard

benchmark就是一个基准

Study design

http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf

probe探头

Used by the professionals

http://www.kaggle.com/

Avoid small sample sizes

Suppose you are predicting a binary outcome
- Diseased/healthy
- Click on ad/not click on ad
One classifier is flipping a coin
Probability of perfect classification is approximately:
- $\left(\frac{1}{2}\right)^{sample \; size}$
- $n = 1$ flipping coin 50% chance of 100% accuracy
- $n = 2$ flipping coin 25% chance of 100% accuracy
- $n = 10$ flipping coin 0.10% chance of 100% accuracy

Rules of thumb for prediction study design

If you have a large sample size
- 60% training
- 20% test
- 20% validation
If you have a medium sample size
- 60% training
- 40% testing
If you have a small sample size
- Do cross validation
- Report caveat of small sample size

Some principles to remember

Set the test/validation set aside and don't look at it
In general randomly sample training and test
Your data sets must reflect structure of the problem
- If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
All subsets should reflect as much diversity as possible
- Random assignment does this
- You can also try to balance by features - but this is tricky

Basic terms

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified

False positive = incorrectly identified

True negative = correctly rejected

False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick

False positive= Healthy people incorrectly identified as sick

True negative = Healthy people correctly identified as healthy

False negative = Sick people incorrectly identified as healthy.

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

Key quantities

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

Key quantities as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

Screening tests

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

General population

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

General population as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

At risk subpopulation

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

At risk subpopulation as fraction

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

Key public health issue

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

Key public health issue

For continuous data

Mean squared error (MSE):

$$\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2$$

Root mean squared error (RMSE):

$$\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}$$

Common error measures

Mean squared error (or root mean squared error)
- Continuous data, sensitive to outliers
Median absolute deviation
- Continuous data, often more robust
Sensitivity (recall)
- If you want few missed positives
Specificity
- If you want few negatives called positives
Accuracy
- Weights false positives/negatives equally
Concordance
- One example is kappa
Predictive value of a positive (precision)
- When you are screeing and prevelance is low

Why a curve?

In binary classification you are predicting one of two categories
- Alive/dead
- Click on ad/don't click
But your predictions are often quantitative
- Probability of being alive
- Prediction on a scale from 1 to 10
The cutoff you choose gives different results,不同的截断值出现不同的结果，如前节中，》0.5就看作spam。改为0.3可能就不是spam

ROC curves

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

An example

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

Area under the curve

AUC = 0.5: random guessing
AUC = 1: perfect classifer
In general AUC of above 0.8 considered "good"

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

What is good?

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

Study design

http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf

Key idea

Accuracy on the training set (resubstitution accuracy) is optimistic
A better estimate comes from an independent set (test set accuracy)
But we can't use the test set when building the model or it becomes part of the training set
So we estimate the test set accuracy with the training set.

Cross-validation

Approach:

Use the training set
Split it into training/test sets
Build a model on the training set
Evaluate on the test set
Repeat and average the estimated errors

Used for:

Picking variables to include in a model
Picking the type of prediction function to use
Picking the parameters in the prediction function
Comparing different predictors

Random subsampling

K-fold

Leave one out

Considerations

For time series data data must be used in "chunks"
For k-fold cross validation
- Larger k = less bias, more variance
- Smaller k = more bias, less variance
Random sampling must be done without replacement
Random sampling with replacement is the bootstrap
- Underestimates of the error
- Can be corrected, but it is complicated (0.632 Bootstrap)
If you cross-validate to pick predictors estimate you must estimate errors on independent data.