【ML--07】机器学习知识点及其算法实现sklearn

来源：互联网发布：网络交易监管系统编辑：程序博客网时间：2024/06/03 13:10

以下10种算法是现在最流行的机器学习算法（含python代码），几乎可以解决绝大部分的问题。

1.线性回归 Linear Regression

线性回归是利用连续性变量来估计实际数值（例如房价，呼叫次数和总销售额等）。我们通过线性回归算法找出自变量和因变量间的最佳线性关系，图形上可以确定一条最佳直线。这条最佳直线就是回归线。这个回归关系可以用Y=aX+b 表示。

Python 代码：

#Import Library#Import other necessary libraries like pandas, numpy...from sklearn import linear_model#Load Train and Test datasets#Identify feature and response variable(s) and values must be numeric and numpy arraysx_train=input_variables_values_training_datasetsy_train=target_variables_values_training_datasetsx_test=input_variables_values_test_datasets# Create linear regression objectlinear = linear_model.LinearRegression()# Train the model using the training sets and check scorelinear.fit(x_train, y_train)linear.score(x_train, y_train)#Equation coefficient and Interceptprint('Coefficient: \n', linear.coef_)print('Intercept: \n', linear.intercept_)#Predict Outputpredicted= linear.predict(x_test)

2.逻辑回归 Logistic Regression

逻辑回归其实是一个分类算法而不是回归算法。通常是利用已知的自变量来预测一个离散型因变量的值（像二进制值0/1，是/否，真/假）。简单来说，它就是通过拟合一个逻辑函数来预测一个事件发生的概率。所以它预测的是一个概率值，它的输出值应该在0到1之间。

Python 代码：

#Import Libraryfrom sklearn.linear_model import LogisticRegression#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create logistic regression objectmodel = LogisticRegression()# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Equation coefficient and Interceptprint('Coefficient: \n', model.coef_)print('Intercept: \n', model.intercept_)#Predict Outputpredicted= model.predict(x_test)

3.决策树 Decision Tree

既可以运用于类别变量（categorical variables）也可以作用于连续变量。这个算法可以让我们把一个总体分为两个或多个群组。分组根据能够区分总体的最重要的特征变量/自变量进行。

Python 代码：

#Import Library#Import other necessary libraries like pandas, numpy...from sklearn import tree#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create tree objectmodel = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini# model = tree.DecisionTreeRegressor() for regression# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Predict Outputpredicted= model.predict(x_test)

4.支持向量机 SVM

给定一组训练样本，每个标记为属于两类，一个SVM训练算法建立了一个模型，分配新的实例为一类或其他类，使其成为非概率二元线性分类。一个SVM模型的例子，如在空间中的点，映射，使得所述不同的类别的例子是由一个明显的差距是尽可能宽划分的表示。新的实施例则映射到相同的空间中，并预测基于它们落在所述间隙侧上属于一个类别。

Python 代码：

#Import Libraryfrom sklearn import svm#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification objectmodel = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Predict Outputpredicted= model.predict(x_test)

5.朴素贝叶斯 Naive Bayes

朴素贝叶斯的思想基础是这样的：对于给出的待分类项，求解在此项出现的条件下各个类别出现的概率，哪个最大，就认为此待分类项属于哪个类别。

Python 代码：

#Import Libraryfrom sklearn.naive_bayes import GaussianNB#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)

6.K邻近算法 KNN

这个算法既可以解决分类问题，也可以用于回归问题，但工业上用于分类的情况更多。 KNN先记录所有已知数据，再利用一个距离函数，找出已知数据中距离未知事件最近的K组数据，最后按照这K组数据里最常见的类别预测该事件。

Python 代码：

#Import Libraryfrom sklearn.neighbors import KNeighborsClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create KNeighbors classifier object modelKNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)

7.K-均值算法 K-means

首先从n个数据对象任意选择 k 个对象作为初始聚类中心；而对于所剩下其它对象，则根据它们与这些聚类中心的相似度（距离），分别将它们分配给与其最相似的（聚类中心所代表的）聚类；然后再计算每个所获新聚类的聚类中心（该聚类中所有对象的均值）；不断重复这一过程直到标准测度函数开始收敛为止。

Python 代码：

#Import Libraryfrom sklearn.cluster import KMeans#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset# Create KNeighbors classifier object modelk_means = KMeans(n_clusters=3, random_state=0)# Train the model using the training sets and check scoremodel.fit(X)#Predict Outputpredicted= model.predict(x_test)

8.随机森林 Random Forest

随机森林是对决策树集合的特有名称。随机森林里我们有多个决策树（所以叫“森林”）。为了给一个新的观察值分类，根据它的特征，每一个决策树都会给出一个分类。随机森林算法选出投票最多的分类作为分类结果。

Python 代码：

#Import Libraryfrom sklearn.ensemble import RandomForestClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Random Forest objectmodel= RandomForestClassifier()# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)

9.降维算法 Dimensionality Reduction Algorithms

我们手上的数据有非常多的特征。虽然这听起来有利于建立更强大精准的模型，但它们有时候反倒也是建模中的一大难题。怎样才能从1000或2000个变量里找到最重要的变量呢？这种情况下降维算法及其他算法，如决策树，随机森林，PCA，因子分析，相关矩阵，和缺省值比例等，就能帮我们解决难题。

Python 代码：

#Import Libraryfrom sklearn import decomposition#Assumed you have training and test data set as train and test# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)# For Factor analysis#fa= decomposition.FactorAnalysis()# Reduced the dimension of training dataset using PCAtrain_reduced = pca.fit_transform(train)#Reduced the dimension of test datasettest_reduced = pca.transform(test)

10.Gradient Boost／Adaboost算法

GBM和AdaBoost都是在有大量数据时提高预测准确度的boosting算法。Boosting是一种集成学习方法。它通过有序结合多个较弱的分类器/估测器的估计结果来提高预测准确度。这些boosting算法在DataCastke 、Kaggle等数据科学竞赛中有出色发挥。

Python 代码：

#Import Libraryfrom sklearn.ensemble import GradientBoostingClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Gradient Boosting Classifier objectmodel= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)

阅读全文

1 0