datascience之机器学习week1
来源:互联网 发布:加qq群软件 编辑:程序博客网 时间:2024/05/21 12:06
About this course
- This course covers the basic ideas behind machine learning/prediction
- Study design - training vs. test sets
- Conceptual issues - out of sample error, ROC curves
- Practical implementation - the caret package
- What this course depends on
- The Data Scientist's Toolbox
- R Programming
- What would be useful
- Exploratory analysis
- Reporting Data and Reproducible Research
- Regression models
Who predicts?
- Local governments -> pension payments
- Google -> whether you will click on an ad
- Amazon -> what movies you will watch
- Insurance companies -> what your risk of death is
- Johns Hopkins -> who will succeed in their programs
Why predict? Glory!
http://www.zimbio.com/photos/Chris+Volinsky
Why predict? Riches!
http://www.heritagehealthprize.com/c/hhp
Why predict? For sport!
http://www.kaggle.com/
Why predict? To save lives!
http://www.oncotypedx.com/en-US/Home
A useful (if a bit advanced) book
The elements of statistical learning
A useful package
http://caret.r-forge.r-project.org/
Machine learning (more advanced material)
https://www.coursera.org/course/ml
Even more resources
- List of machine learning resources on Quora
- List of machine learning resources from Science
- Advanced notes from MIT open courseware
- Advanced notes from CMU
- Kaggle - machine learning competitions
The central dogma of prediction
What can go wrong
http://www.sciencemag.org/content/343/6176/1203.full.pdf
Components of a predictor
question -> input data -> features -> algorithm -> parameters -> evaluationSPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationStart with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationhttp://rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationDear Jeff,
Can you send me your address so I can send you the invitation?
Thanks,
Ben
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationDear Jeff,
Can you send me your address so I can send you the invitation?
Thanks,
Ben
Frequency of you $= 2/17 = 0.118$
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluation make address all num3d our over remove internet order mail receive will people report addresses1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.002 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.143 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.754 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.005 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.006 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00 0.00 free business email you credit your font num000 money hp hpl george num650 lab labs telnet1 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 02 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 03 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 04 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 05 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 06 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0 0 0 0 0 0 num857 data num415 num85 technology num1999 parts pm direct cs meeting original project re edu1 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.002 0 0 0 0 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.003 0 0 0 0 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.064 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.005 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.006 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 table conference charSemicolon charRoundbracket charSquarebracket charExclamation charDollar1 0 0 0.00 0.000 0 0.778 0.0002 0 0 0.00 0.132 0 0.372 0.1803 0 0 0.01 0.143 0 0.276 0.1844 0 0 0.00 0.137 0 0.137 0.0005 0 0 0.00 0.135 0 0.135 0.0006 0 0 0.00 0.223 0 0.000 0.000 charHash capitalAve capitalLong capitalTotal type1 0.000 3.756 61 278 spam2 0.048 5.114 101 1028 spam3 0.010 9.821 485 2259 spam4 0.000 3.537 40 191 spam5 0.000 3.537 40 191 spam6 0.000 3.000 15 54 spam
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationSPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationOur algorithm
- Find a value $C$.
- frequency of 'your' $>$ C predict "spam"
SPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationSPAM Example
question -> input data -> features -> algorithm -> parameters -> evaluationprediction nonspam spam nonspam 0.4590 0.1017 spam 0.1469 0.2923
Accuracy$ \approx 0.459 + 0.292 = 0.751$
Relative order of importance
question > data > features > algorithmsAn important point
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Garbage in = Garbage out
question -> input data -> features -> algorithm -> parameters -> evaluation- May be easy (movie ratings -> new movie ratings)
- May be harder (gene expression data -> disease)
- Depends on what is a "good prediction".
- Often more data > better models
- The most important step!
Features matter!
question -> input data -> features -> algorithm -> parameters -> evaluationProperties of good features
- Lead to data compression
- Retain relevant information
- Are created based on expert application knowledge
Common mistakes
- Trying to automate feature selection
- Not paying attention to data-specific quirks
- Throwing away information unnecessarily
May be automated with care
question -> input data -> features -> algorithm -> parameters -> evaluationhttp://arxiv.org/pdf/1112.6209v5.pdf
Algorithms matter less than you'd think
question -> input data -> features -> algorithmhttp://arxiv.org/pdf/math/0606441.pdf
Issues to consider
http://strata.oreilly.com/2013/09/gaining-access-to-the-best-machine-learning-methods.html
Prediction is about accuracy tradeoffs
- Interpretability versus accuracy
- Speed versus accuracy
- Simplicity versus accuracy
- Scalability versus accuracy
Interpretability matters
http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf
Scalability matters
http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
In sample versus out of sample
In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.
Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.
Key ideas
- Out of sample error is what you care about
- In sample error $<$ out of sample error
- The reason is overfitting
- Matching your algorithm to the data you have
In sample versus out of sample errors
Prediction rule 1
- capitalAve $>$ 2.7 = "spam"
- capitalAve $<$ 2.40 = "nonspam"
- capitalAve between 2.40 and 2.45 = "spam"
- capitalAve between 2.45 and 2.7 = "nonspam"
Apply Rule 1 to smallSpam
nonspam spam nonspam 5 0 spam 0 5
Prediction rule 2
- capitalAve $>$ 2.40 = "spam"
- capitalAve $\leq$ 2.40 = "nonspam"
Apply Rule 2 to smallSpam
nonspam spam nonspam 5 1 spam 0 4
Apply to complete spam data
nonspam spam nonspam 2141 588 spam 647 1225
nonspam spam nonspam 2224 642 spam 564 1171
[1] 0.7316
[1] 0.7379
Look at accuracy
[1] 3366
[1] 3395
What's going on?
Overfitting- Data have two parts
- Signal
- Noise
- The goal of a predictor is to find signal
- You can always design a perfect in-sample predictor
- You capture both signal + noise when you do that
- Predictor won't perform as well on new samples
http://en.wikipedia.org/wiki/Overfitting
Prediction study design
- Define your error rate
- Split data into:
- Training, Testing, Validation (optional)
- On the training set pick features
- Use cross-validation
- On the training set pick prediction function
- Use cross-validation
- If no validation
- Apply 1x to test set(即只用一次)
- If validation
- Apply to test set and refine
- Apply 1x to validation
Know the benchmarks
http://www.heritagehealthprize.com/c/hhp/leaderboard
benchmark就是一个基准
Study design
http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf
probe探头
Used by the professionals
http://www.kaggle.com/
Avoid small sample sizes
- Suppose you are predicting a binary outcome
- Diseased/healthy
- Click on ad/not click on ad
- One classifier is flipping a coin
- Probability of perfect classification is approximately:
- $\left(\frac{1}{2}\right)^{sample \; size}$
- $n = 1$ flipping coin 50% chance of 100% accuracy
- $n = 2$ flipping coin 25% chance of 100% accuracy
- $n = 10$ flipping coin 0.10% chance of 100% accuracy
Rules of thumb for prediction study design
- If you have a large sample size
- 60% training
- 20% test
- 20% validation
- If you have a medium sample size
- 60% training
- 40% testing
- If you have a small sample size
- Do cross validation
- Report caveat of small sample size
Some principles to remember
- Set the test/validation set aside and don't look at it
- In general randomly sample training and test
- Your data sets must reflect structure of the problem
- If predictions evolve with time split train/test in time chunks (calledbacktesting in finance)
- All subsets should reflect as much diversity as possible
- Random assignment does this
- You can also try to balance by features - but this is tricky
Basic terms
In general, Positive = identified and negative = rejected. Therefore:
True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected
Medical testing example:
True positive = Sick people correctly diagnosed as sick
False positive= Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy.
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
Key quantities
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key quantities as fractions
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Screening tests
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
General population
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
General population as fractions
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
At risk subpopulation
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
At risk subpopulation as fraction
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key public health issue
http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/
Key public health issue
For continuous data
Mean squared error (MSE):
$$\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2$$
Root mean squared error (RMSE):
$$\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}$$
Common error measures
- Mean squared error (or root mean squared error)
- Continuous data, sensitive to outliers
- Median absolute deviation
- Continuous data, often more robust
- Sensitivity (recall)
- If you want few missed positives
- Specificity
- If you want few negatives called positives
- Accuracy
- Weights false positives/negatives equally
- Concordance
- One example is kappa
- Predictive value of a positive (precision)
- When you are screeing and prevelance is low
Why a curve?
- In binary classification you are predicting one of two categories
- Alive/dead
- Click on ad/don't click
- But your predictions are often quantitative
- Probability of being alive
- Prediction on a scale from 1 to 10
- The cutoff you choose gives different results,不同的截断值出现不同的结果,如前节中,》0.5就看作spam。改为0.3可能就不是spam
ROC curves
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
An example
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Area under the curve
- AUC = 0.5: random guessing
- AUC = 1: perfect classifer
- In general AUC of above 0.8 considered "good"
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
What is good?
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Study design
http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf
Key idea
- Accuracy on the training set (resubstitution accuracy) is optimistic
- A better estimate comes from an independent set (test set accuracy)
- But we can't use the test set when building the model or it becomes part of the training set
- So we estimate the test set accuracy with the training set.
Cross-validation
Approach:
Use the training set
Split it into training/test sets
Build a model on the training set
Evaluate on the test set
Repeat and average the estimated errors
Used for:
Picking variables to include in a model
Picking the type of prediction function to use
Picking the parameters in the prediction function
Comparing different predictors
Random subsampling
K-fold
Leave one out
Considerations
- For time series data data must be used in "chunks"
- For k-fold cross validation
- Larger k = less bias, more variance
- Smaller k = more bias, less variance
- Random sampling must be done without replacement
- Random sampling with replacement is the bootstrap
- Underestimates of the error
- Can be corrected, but it is complicated (0.632 Bootstrap)
- If you cross-validate to pick predictors estimate you must estimate errors on independent data.
A succcessful predictor
fivethirtyeight.com
Polling data
http://www.gallup.com/
Weighting the data
http://www.fivethirtyeight.com/2010/06/pollster-ratings-v40-methodology.html
Key idea
To predict X use data related to XKey idea
To predict player performance use data about player performanceKey idea
To predict movie preferences use data about movie preferencesKey idea
To predict hospitalizations use data about hospitalizationsNot a hard rule
To predict flu outbreaks use Google searcheshttp://www.google.org/flutrends/
Looser connection = harder prediction
Data properties matter
Unrelated data is the most common mistake
http://www.nejm.org/doi/full/10.1056/NEJMon1211064
- datascience之机器学习week1
- datascience机器学习week2
- 机器学习week1-week2
- week1 机器学习介绍
- 机器学习笔记week1
- 机器学习week1-Introduction
- Coursera机器学习 Week1 笔记
- Machine Learning机器学习笔记-week1
- 机器学习笔记week1(Andrew NG)
- Coursera机器学习笔记(week1)
- coursera机器学习 week1&week2&week3 总结
- Python学习之路Week1
- Coursera上的Andrew Ng《机器学习》学习笔记Week1
- 吴恩达深度学习笔记(三)week1机器学习策略
- Stanford机器学习课程-week1-Introduction & Linear Regression
- 机器学习week1-3笔记:线性回归、逻辑回归
- coursera《机器学习》吴恩达-week1-01 课程介绍
- coursera《机器学习》吴恩达-week1-03 梯度下降算法
- vim配置及插件安装管理(超级详细)
- hdu 4932 Miaomiao's Geometry
- 快速反转速个数组方法
- html年月日下拉联动菜单 年月日三下拉框联动
- 几种任务调度的 Java 实现方法与比较
- datascience之机器学习week1
- RCP应用篇之Eclipse表单
- BT雷人的程序语言(大全)
- Jamendo学习相关
- 自己写的MapReduce程序运行问题
- Android布局文件Value中设置格式字符串并在Java代码中使用的一点说明
- Java 7 Fork/Join 并行计算框架概览
- Robotium在Eclipse、Android Studio中的环境搭建
- Android性能优化【终极篇】