datascience之机器学习week1

来源:互联网 发布:加qq群软件 编辑:程序博客网 时间:2024/05/21 12:06

About this course

  • This course covers the basic ideas behind machine learning/prediction
    • Study design - training vs. test sets
    • Conceptual issues - out of sample error, ROC curves
    • Practical implementation - the caret package
  • What this course depends on
    • The Data Scientist's Toolbox
    • R Programming
  • What would be useful
    • Exploratory analysis
    • Reporting Data and Reproducible Research
    • Regression models

Who predicts?

  • Local governments -> pension payments
  • Google -> whether you will click on an ad
  • Amazon -> what movies you will watch
  • Insurance companies -> what your risk of death is
  • Johns Hopkins -> who will succeed in their programs

Why predict? Glory!

http://www.zimbio.com/photos/Chris+Volinsky


Why predict? Riches!

http://www.heritagehealthprize.com/c/hhp


Why predict? For sport!

http://www.kaggle.com/


Why predict? To save lives!

http://www.oncotypedx.com/en-US/Home


A useful (if a bit advanced) book

The elements of statistical learning


A useful package

http://caret.r-forge.r-project.org/


Machine learning (more advanced material)

https://www.coursera.org/course/ml


Even more resources

  • List of machine learning resources on Quora
  • List of machine learning resources from Science
  • Advanced notes from MIT open courseware
  • Advanced notes from CMU
  • Kaggle - machine learning competitions

The central dogma of prediction


What can go wrong

http://www.sciencemag.org/content/343/6176/1203.full.pdf


Components of a predictor

question -> input data -> features -> algorithm -> parameters -> evaluation

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?


SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

http://rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html


SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?

Thanks,

Ben


SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Dear Jeff,

Can you send me your address so I can send you the invitation?

Thanks,

Ben

Frequency of you $= 2/17 = 0.118$


SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
library(kernlab)data(spam)head(spam)
  make address  all num3d  our over remove internet order mail receive will people report addresses1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64   0.00   0.00      0.002 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79   0.65   0.21      0.143 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45   0.12   0.00      1.754 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.005 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31   0.31   0.00      0.006 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00 0.00   0.00   0.00      0.00  free business email  you credit your font num000 money hp hpl george num650 lab labs telnet1 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00  0.00  0   0      0      0   0    0      02 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43  0.43  0   0      0      0   0    0      03 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16  0.06  0   0      0      0   0    0      04 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      05 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00  0.00  0   0      0      0   0    0      06 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00  0.00  0   0      0      0   0    0      0  num857 data num415 num85 technology num1999 parts pm direct cs meeting original project   re  edu1      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.002      0    0      0     0          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.003      0    0      0     0          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.064      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.005      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.006      0    0      0     0          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00  table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar1     0          0          0.00            0.000                 0           0.778      0.0002     0          0          0.00            0.132                 0           0.372      0.1803     0          0          0.01            0.143                 0           0.276      0.1844     0          0          0.00            0.137                 0           0.137      0.0005     0          0          0.00            0.135                 0           0.135      0.0006     0          0          0.00            0.223                 0           0.000      0.000  charHash capitalAve capitalLong capitalTotal type1    0.000      3.756          61          278 spam2    0.048      5.114         101         1028 spam3    0.010      9.821         485         2259 spam4    0.000      3.537          40          191 spam5    0.000      3.537          40          191 spam6    0.000      3.000          15           54 spam

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
plot(density(spam$your[spam$type=="nonspam"]),     col="blue",main="",xlab="Frequency of 'your'")lines(density(spam$your[spam$type=="spam"]),col="red")
plot of chunk unnamed-chunk-1

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation

Our algorithm

  • Find a value $C$.
  • frequency of 'your' $>$ C predict "spam"

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
plot(density(spam$your[spam$type=="nonspam"]),     col="blue",main="",xlab="Frequency of 'your'")lines(density(spam$your[spam$type=="spam"]),col="red")abline(v=0.5,col="black")
plot of chunk unnamed-chunk-2

SPAM Example

question -> input data -> features -> algorithm -> parameters -> evaluation
prediction <- ifelse(spam$your > 0.5,"spam","nonspam")table(prediction,spam$type)/length(spam$type)
prediction nonspam   spam   nonspam  0.4590 0.1017   spam     0.1469 0.2923

Accuracy$ \approx 0.459 + 0.292 = 0.751$

Relative order of importance

question > data > features > algorithms

An important point

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

John Tukey

Garbage in = Garbage out

question -> input data -> features -> algorithm -> parameters -> evaluation
  1. May be easy (movie ratings -> new movie ratings)
  2. May be harder (gene expression data -> disease)
  3. Depends on what is a "good prediction".
  4. Often more data > better models
  5. The most important step!

Features matter!

question -> input data -> features -> algorithm -> parameters -> evaluation

Properties of good features

  • Lead to data compression
  • Retain relevant information
  • Are created based on expert application knowledge

Common mistakes

  • Trying to automate feature selection
  • Not paying attention to data-specific quirks
  • Throwing away information unnecessarily

May be automated with care

question -> input data -> features -> algorithm -> parameters -> evaluation

http://arxiv.org/pdf/1112.6209v5.pdf


Algorithms matter less than you'd think

question -> input data -> features -> algorithm
 -> parameters -> evaluation

http://arxiv.org/pdf/math/0606441.pdf


Issues to consider

http://strata.oreilly.com/2013/09/gaining-access-to-the-best-machine-learning-methods.html


Prediction is about accuracy tradeoffs

  • Interpretability versus accuracy
  • Speed versus accuracy
  • Simplicity versus accuracy
  • Scalability versus accuracy

Interpretability matters

http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf


Scalability matters

http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/

http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html


In sample versus out of sample

In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.

Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.

Key ideas

  1. Out of sample error is what you care about
  2. In sample error $<$ out of sample error
  3. The reason is overfitting
    • Matching your algorithm to the data you have

In sample versus out of sample errors

library(kernlab); data(spam); set.seed(333)smallSpam <- spam[sample(dim(spam)[1],size=10),]spamLabel <- (smallSpam$type=="spam")*1 + 1plot(smallSpam$capitalAve,col=spamLabel)
plot of chunk loadData

Prediction rule 1

  • capitalAve $>$ 2.7 = "spam"
  • capitalAve $<$ 2.40 = "nonspam"
  • capitalAve between 2.40 and 2.45 = "spam"
  • capitalAve between 2.45 and 2.7 = "nonspam"

Apply Rule 1 to smallSpam

rule1 <- function(x){  prediction <- rep(NA,length(x))  prediction[x > 2.7] <- "spam"  prediction[x < 2.40] <- "nonspam"  prediction[(x >= 2.40 & x <= 2.45)] <- "spam"  prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"  return(prediction)}table(rule1(smallSpam$capitalAve),smallSpam$type)
          nonspam spam  nonspam       5    0  spam          0    5

Prediction rule 2

  • capitalAve $>$ 2.40 = "spam"
  • capitalAve $\leq$ 2.40 = "nonspam"

Apply Rule 2 to smallSpam

rule2 <- function(x){  prediction <- rep(NA,length(x))  prediction[x > 2.8] <- "spam"  prediction[x <= 2.8] <- "nonspam"  return(prediction)}table(rule2(smallSpam$capitalAve),smallSpam$type)
          nonspam spam  nonspam       5    1  spam          0    4

Apply to complete spam data

table(rule1(spam$capitalAve),spam$type)
          nonspam spam  nonspam    2141  588  spam        647 1225
table(rule2(spam$capitalAve),spam$type)
          nonspam spam  nonspam    2224  642  spam        564 1171
mean(rule1(spam$capitalAve)==spam$type)
[1] 0.7316
mean(rule2(spam$capitalAve)==spam$type)
[1] 0.7379

Look at accuracy

sum(rule1(spam$capitalAve)==spam$type)
[1] 3366
sum(rule2(spam$capitalAve)==spam$type)
[1] 3395

What's going on?

Overfitting
  • Data have two parts
    • Signal
    • Noise
  • The goal of a predictor is to find signal
  • You can always design a perfect in-sample predictor
  • You capture both signal + noise when you do that
  • Predictor won't perform as well on new samples

http://en.wikipedia.org/wiki/Overfitting

Prediction study design

  1. Define your error rate
  2. Split data into:
    • Training, Testing, Validation (optional)
  3. On the training set pick features
    • Use cross-validation
  4. On the training set pick prediction function
    • Use cross-validation
  5. If no validation
    • Apply 1x to test set(即只用一次)
  6. If validation
    • Apply to test set and refine
    • Apply 1x to validation

Know the benchmarks

http://www.heritagehealthprize.com/c/hhp/leaderboard

benchmark就是一个基准


Study design

http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf

probe探头


Used by the professionals

http://www.kaggle.com/


Avoid small sample sizes

  • Suppose you are predicting a binary outcome
    • Diseased/healthy
    • Click on ad/not click on ad
  • One classifier is flipping a coin
  • Probability of perfect classification is approximately:
    • $\left(\frac{1}{2}\right)^{sample \; size}$
    • $n = 1$ flipping coin 50% chance of 100% accuracy
    • $n = 2$ flipping coin 25% chance of 100% accuracy
    • $n = 10$ flipping coin 0.10% chance of 100% accuracy

Rules of thumb for prediction study design

  • If you have a large sample size
    • 60% training
    • 20% test
    • 20% validation
  • If you have a medium sample size
    • 60% training
    • 40% testing
  • If you have a small sample size
    • Do cross validation
    • Report caveat of small sample size

Some principles to remember

  • Set the test/validation set aside and don't look at it
  • In general randomly sample training and test
  • Your data sets must reflect structure of the problem
    • If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
  • All subsets should reflect as much diversity as possible
    • Random assignment does this
    • You can also try to balance by features - but this is tricky

Basic terms

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified

False positive = incorrectly identified

True negative = correctly rejected

False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick

False positive= Healthy people incorrectly identified as sick

True negative = Healthy people correctly identified as healthy

False negative = Sick people incorrectly identified as healthy.

http://en.wikipedia.org/wiki/Sensitivity_and_specificity


Key quantities

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


Key quantities as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


Screening tests

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


General population

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


General population as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


At risk subpopulation

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


At risk subpopulation as fraction

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


Key public health issue

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/


Key public health issue


For continuous data

Mean squared error (MSE):

$$\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2$$

Root mean squared error (RMSE):

$$\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}$$


Common error measures

  1. Mean squared error (or root mean squared error)
    • Continuous data, sensitive to outliers
  2. Median absolute deviation
    • Continuous data, often more robust
  3. Sensitivity (recall)
    • If you want few missed positives
  4. Specificity
    • If you want few negatives called positives
  5. Accuracy
    • Weights false positives/negatives equally
  6. Concordance
    • One example is kappa
  7. Predictive value of a positive (precision)
    • When you are screeing and prevelance is low

Why a curve?

  • In binary classification you are predicting one of two categories
    • Alive/dead
    • Click on ad/don't click
  • But your predictions are often quantitative
    • Probability of being alive
    • Prediction on a scale from 1 to 10
  • The cutoff you choose gives different results,不同的截断值出现不同的结果,如前节中,》0.5就看作spam。改为0.3可能就不是spam

ROC curves

http://en.wikipedia.org/wiki/Receiver_operating_characteristic


An example

http://en.wikipedia.org/wiki/Receiver_operating_characteristic


Area under the curve

  • AUC = 0.5: random guessing
  • AUC = 1: perfect classifer
  • In general AUC of above 0.8 considered "good"

http://en.wikipedia.org/wiki/Receiver_operating_characteristic


What is good?

http://en.wikipedia.org/wiki/Receiver_operating_characteristic


Study design

http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf


Key idea

  1. Accuracy on the training set (resubstitution accuracy) is optimistic
  2. A better estimate comes from an independent set (test set accuracy)
  3. But we can't use the test set when building the model or it becomes part of the training set
  4. So we estimate the test set accuracy with the training set.

Cross-validation

Approach:

  1. Use the training set

  2. Split it into training/test sets

  3. Build a model on the training set

  4. Evaluate on the test set

  5. Repeat and average the estimated errors

Used for:

  1. Picking variables to include in a model

  2. Picking the type of prediction function to use

  3. Picking the parameters in the prediction function

  4. Comparing different predictors


Random subsampling


K-fold


Leave one out


Considerations

  • For time series data data must be used in "chunks"
  • For k-fold cross validation
    • Larger k = less bias, more variance
    • Smaller k = more bias, less variance
  • Random sampling must be done without replacement
  • Random sampling with replacement is the bootstrap
    • Underestimates of the error
    • Can be corrected, but it is complicated (0.632 Bootstrap)
  • If you cross-validate to pick predictors estimate you must estimate errors on independent data.

A succcessful predictor

fivethirtyeight.com


Polling data

http://www.gallup.com/


Weighting the data

http://www.fivethirtyeight.com/2010/06/pollster-ratings-v40-methodology.html


Key idea

To predict X use data related to X

Key idea

To predict player performance use data about player performance


Key idea

To predict movie preferences use data about movie preferences


Key idea

To predict hospitalizations use data about hospitalizations


Not a hard rule

To predict flu outbreaks use Google searches

http://www.google.org/flutrends/


Looser connection = harder prediction


Data properties matter


Unrelated data is the most common mistake

http://www.nejm.org/doi/full/10.1056/NEJMon1211064


0 0
原创粉丝点击