机器学习基础维基翻译保序回归随机森林 Pipeline处理及简单的sklearn例子

来源：互联网发布：御坂美琴棒球帽淘宝编辑：程序博客网时间：2024/05/13 05:11

Isotonic regression(保序回归)
In numerical analysis, isotonic regression (IR) involves finding a weighted
least-squares fit x to Rn with weights vector w to Rn subject to a set of
non-contradictory constraints of the kind xi >= xj.
(x 分量保序)
Such constraints define partial order or total order and can be represented
as a directed graph G = (N, E)(有向图 N：节点 E：节点间的映射)
where N is the set of variables involved, and E is the ste of pairs (i, j)
for each constraint xi >= xj. Thus, the IR problem corresponds to the following quadratic program(QP) (二次规划)

实现代码：

import numpy as np from sklearn.utils import check_random_statefrom sklearn.isotonic import IsotonicRegressionfrom sklearn.linear_model import LinearRegressionimport matplotlib.pyplot as plt from matplotlib.collections import LineCollectionn = 100x = np.arange(n)rs = check_random_state(0)y = rs.randint(-50, 50, size = (n,)) + 50. * np.log(1 + np.arange(n))ir = IsotonicRegression()y_ = ir.fit_transform(x, y)lr = LinearRegression()'''print xprint x[:, np.newaxis]'''lr.fit(x[:, np.newaxis], y)seguments = [[[i, y[i]], [i, y_[i]]] for i in range(n)]lc = LineCollection(seguments, zorder = 0)lc.set_array(np.ones(len(y)))lc.set_linewidths(0.5 * np.ones(n))fig = plt.figure()plt.plot(x, y, "r.", markersize = 12)plt.plot(x, y_, "g.-", markersize = 12)plt.plot(x, lr.predict(x[:, np.newaxis]), "b-")plt.gca().add_collection(lc)plt.legend(("Data", "isotonic Fit", "Linear Fit"), loc = "lower right")plt.title("isotonic regression")plt.show()

Random forest
Random forests is a notion of general technique of random decision forests
that are an ensemble learning(集成学习)
method for classification, regression and other tasks, that operate by
constructing a mutitude of decison trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forest correct for decision trees' habit of overfitting to their training set.
(随机森林对决策树过拟合数据的特点进行了矫正)

Decision tree

Decision tree learning uses a decision tree as a predictive(预测) model
whice maps observations about an item to conclusions about the item's
target value. It is one of the predictive modeling approaches used in
statistics, data maining and machine learning. Tree models where the target
variable can be take a finite set of values called classification trees, In
these tree structures, leaves represent conjunctions(结合) of features
that lead to those class labels. Decision trees where the target varibale
can take continuous values (typically real numbers) are called regression
trees.

In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree
describes data but not decisions; rather the resulting classification tree
can be an input for decision making.This page deals with decision trees in
data mining.

Bootstrap aggregating(聚合)
Given a standard training set D of size n, bagging generates m new training
sets Di, each of size n', by sampling from D uniformly and with replacement.
(有放回抽样)
By sampling with replacement, some observations may be repeated in each Di
is expected to have the fraction (1 - 1/e) of the unique example of D, the rest being duplicates. This kind of sample is known as a bootstrap sample.
The m model are fitted using the above m bootstrap samples and combined by
averaging the output (for regressiion) or voting(for classification).
(即特殊抽样后利用投票或平均值进行拟合)

From bagging to random forests
The above procedure describes the original bagging algorithm for tree.
Random forests differ in only ne way from this general scheme:(方案)
type use a modified tree learning algorithm that selects, at each candidate
split in the learning process, a random subset of the features. This
process is sometimes called "feature bagging".
(随机选取一部分特征，并进行bagging)
The reason for doing this is the correlation of the trees in an
ordinary bootstrap sample: if one or a few features are very strong
predictors for the response variable (target output), these features
will be selected in many of the B trees,causing them to become correalted
. An analysis of how bagging and random subspace projection contribute
to accurarcy gains under different conditions is given by Ho.
(随机森林:组合随机抽选样本，并随机特征投影)

回归树的基本思想是将数据集利用决策树来划分集合（仅仅是利用特征进行划分）
在每一个划分的子集上实现回归，再将回归的结果进行平均得解。
相应可以推广到随机森林场合。

sklearn.ensemble::RandomForestRegressor
参数n_estimators 指出了随机森林中使用的树的数量。

numpy.random::shuffle 可以将数组打乱。

sklearn.preprocessing::Imputer(差补)

下面是对差补进行比较的随机森林程序：

import numpy as np from sklearn.datasets import load_boston rng = np.random.RandomState(0)dataset = load_boston()X_full, y_full = dataset.data, dataset.target n_samples = X_full.shape[0]n_feature = X_full.shape[1]from sklearn.ensemble import RandomForestRegressor from sklearn.cross_validation import cross_val_scoreestimator = RandomForestRegressor(random_state = 0, n_estimators = 100)score = cross_val_score(estimator, X_full, y_full).mean()print "Score with the entire dataset = %.2f" % score missing_rate = 0.75n_missing_samples = np.floor(n_samples * missing_rate)missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,  dtype = np.bool), np.ones(n_missing_samples, dtype = np.bool)))rng.shuffle(missing_samples)missing_features = rng.randint(0, n_feature, n_missing_samples)X_filtered = X_full[~missing_samples, :]y_filtered = y_full[~missing_samples]estimator = RandomForestRegressor(random_state = 0, n_estimators = 100)score = cross_val_score(estimator, X_filtered, y_filtered).mean()print "Score without the samples containing missing values = %.2f" % score X_missing = X_full.copy()X_missing[np.where(missing_samples)[0], missing_features] = 0y_missing = y_full.copy()from sklearn.pipeline import Pipeline from sklearn.preprocessing import Imputerestimator = Pipeline([("imputer", Imputer(missing_values = 0, strategy = "mean", axis = 0)), ("forest", RandomForestRegressor(random_state = 0, n_estimators = 100))])score = cross_val_score(estimator, X_missing, y_missing).mean()print "Score after imputation of the missing values = %.2f" % score

这里的结论是利用差补结果一般要更好。

matplotlib.pyplot::figure
figsize指定显示的宽度及高度。

plt.axes([.2, .2, .7, .7])
指出占整个图像的画图区域的坐标轴的范围（方形区域）

plt.clf()
clear the current figure.

np.logspace(start, end, num = 50)
返回以10为底linspace的相应序列为幂指的指数函数值。

logistic回归中有类似支持向量机中的惩罚项，对参数长度惩罚。

利用网格法求解：

import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, decomposition, datasets from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV logistic = linear_model.LogisticRegression()pca = decomposition.PCA()pipe = Pipeline(steps = [('pca', pca), ('logistic', logistic)])digits = datasets.load_digits()X_digits = digits.data y_digits = digits.targetpca.fit(X_digits)plt.figure(1, figsize = (4, 3))plt.clf()plt.axes([.2, .2, .7, .7])plt.plot(pca.explained_variance_, linewidth = 2)plt.axis("tight")plt.xlabel("n_components")plt.ylabel("explained_variance_")n_components = [20, 40, 64]Cs = np.logspace(-4, 4, 3)estimator = GridSearchCV(pipe, dict(pca__n_components = n_components, logistic__C = Cs))estimator.fit(X_digits, y_digits)plt.axvline(estimator.best_estimator_.named_steps["pca"].n_components, linestyle = ":", label = "n_components chosen")plt.legend(prop = dict(size = 12))plt.show()

0 0

机器学习基础 维基翻译 保序回归 随机森林 Pipeline处理 及简单的sklearn例子

机器学习基础维基翻译保序回归随机森林 Pipeline处理及简单的sklearn例子