Data mining(I)

来源:互联网 发布:ubuntu设置不休眠 编辑:程序博客网 时间:2024/06/06 19:22

Learning Notes of Dr.Bo Yuan.THU 《Data:Theory and Algorithm》Part I

  • Definition:Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive,incomplete and noisy data.
    Not a fully automatically process.
    From data to intelligence.
    Data、information、knowledge、decision support
    这里写图片描述
  • Classification
    这里写图片描述
    Algorithms:
    Decision Tree、KNN、Neural Networks、SVM
    Overfitting
    Cross Validation Training data 、Test data
    这里写图片描述
    Confusion Matrix 、 TP(True Positive) 、FP(False Positive) 、FN(False Negative) 、TN(True Negative) 、TPR(True Positive Rate)、 TNR(True Negative Rate)、 Accuracy
    TP+FP+FN+TN = number of samples
    这里写图片描述
    ROC:Receiver Operating Characteristic
    AUC:Area Under ROC Curve #AUC near 1 is good
    这里写图片描述
    Cost sensitive learning
    Lift analysis

  • Clustering
    Difference:Clustering is Unsupervised Learning,Classification is Supervised Learning
    这里写图片描述
    Association Rule

  • Regression
    这里写图片描述
    Underfitting
    Overfitting

  • Data Preprocessing
    这里写图片描述
    Garbage Input garbage Output
    Cloud Computing
    Parallel Computing
原创粉丝点击