Python Machine Learning : Chap2

来源：互联网发布：windows商店应用内购编辑：程序博客网时间：2024/06/06 07:21

2 Training Machine Learning Algorithms for Classification

分类算法：感知机、自适应线性神经元（adaptive linear neuros）

The topics that we will cover in this chapter are as follows:

• Building an intuition for machine learning algorithms
• Using pandas, NumPy, and matplotlib to read in, process, and visualize data
• Implementing linear classification algorithms in Python

人工神经元

简单讲了一下机器学习的历史，主要介绍了一下Rosenblatt的感知机规则是如何起作用的。
Rosenblatt的感知机提出是基于MCP神经模型，随便翻看了几句话发现还简单介绍了一下线性代数啊，看来很适合入门。

用Python实现一个感知机学习算法

Implementation Code

import numpy as npclass Perceptron(object):"""Perceptron classifier.Parameters------------eta : floatLearning rate (between 0.0 and 1.0)n_iter : intPasses over the training dataset.Attributes-----------w_ : 1d-arrayWeights after fitting.errors_ : list Number of misclassifications in every epoch."""def __init__(self, eta=0.01, n_iter=10):self.eta = eta           #学习率self.n_iter = n_iter     #训练次数def fit(self, X, y):  #拟合函数"""Fit training data.Parameters----------X : {array-like}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.训练向量，n_sample ：样本数量        n_feature: 特征数量y : array-like, shape = [n_samples]Target values.目标值Returns-------self : object返回类型"""self.w_ = np.zeros(1 + X.shape[1]) #拟合后的权重self.errors_ = [] #分类错误的数量for _ in range(self.n_iter):errors = 0for xi, target in zip(X, y):update = self.eta * (target - self.predict(xi)) self.w_[1:] += update * xi #更新权重self.w_[0] += updateerrors += int(update != 0.0)self.errors_.append(errors)return selfdef net_input(self, X):"""Calculate net input"""return np.dot(X, self.w_[1:]) + self.w_[0] #求点积def predict(self, X):"""Return class label after unit step"""return np.where(self.net_input(X) >= 0.0, 1, -1)

通过上述代码，给定学习率eta、迭代次数n_iter，可以初始化我们的感知机，通过fit方法，可以初始化权重矩阵为m+1维零向量，m是特征维数，加上的1是for the zero-weight，即阈值。
权重初始完成后，fit在每个独立的样本上循环计算，并更新权重。predict方法用于预测标签。

Training a perceptron model on the Iris dataset

只考虑两个特征：sepal length（花萼长度）和petal length（花瓣长度），只选择两个花朵的种类：Setosa（刺芒野古草），Versicolor（大概是变色鸢尾吧）。

首先用pandas来直接load Iris dataset：

>>> import pandas as pd'''直接从UCI的machinelearning库里载入数据，送入DataFrame（pandas的一种数据类型，还有一种是series）。'''>>> df = pd.read_csv('https://archive.ics.uci.edu/ml/'... 'machine-learning-databases/iris/iris.data', header=None)>>> df.tail() #打印最后五行

提取前100类标签（that correspond to the 50 Iris-Setosa and 50 Iris-Versicolor flowers）。
转化类标签为1（Versicolor），-1(Setosa)，存入向量y （that we assign to a vector y where the values method of a pandas DataFrame yields the corresponding NumPy representation）
提取前100个训练样本，存入特征矩阵X，然后可将其可视化为二维散点图：

>>> import matplotlib.pyplot as plt>>> import numpy as np>>> y = df.iloc[0:100, 4].values #前100类标签>>> y = np.where(y == 'Iris-setosa', -1, 1) #如果样本为y则置-1，否则置1>>> X = df.iloc[0:100, [0, 2]].values #前100个训练样本>>> plt.scatter(X[:50, 0], X[:50, 1], #开始可视化... color='red', marker='o', label='setosa')>>> plt.scatter(X[50:100, 0], X[50:100, 1],... color='blue', marker='x', label='versicolor')>>> plt.xlabel('sepal length')>>> plt.ylabel('petal length')>>> plt.legend(loc='upper left')>>> plt.show()

训练：

>>> ppn = Perceptron(eta=0.1, n_iter=10)#初始化，给定学习率和迭代次数>>> ppn.fit(X, y)#根据特征和标签进行拟合>>> plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_,... marker='o') #画出每一轮的错误分类误差，观察算法是否收敛并且找到一个决策边界用于分类>>> plt.xlabel('Epochs')>>> plt.ylabel('Number of misclassifications')>>> plt.show()

可视化决策平面：

写个小函数：

from matplotlib.colors import ListedColormapdef plot_decision_regions(X, y, classifier, resolution=0.02):    # setup marker generator and color map    markers = ('s', 'x', 'o', '^', 'v')    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')    cmap = ListedColormap(colors[:len(np.unique(y))])    # plot the decision surface    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),                        np.arange(x2_min, x2_max, resolution))    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)    Z = Z.reshape(xx1.shape)    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)    plt.xlim(xx1.min(), xx1.max())    # plot class samples    for idx, cl in enumerate(np.unique(y)):        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],                    alpha=0.8, c=cmap(idx),                    marker=markers[idx], label=cl)

可视化：

>>> plot_decision_regions(X, y, classifier=ppn)>>> plt.xlabel('sepal length [cm]')>>> plt.ylabel('petal length [cm]')>>> plt.legend(loc='upper left')>>> plt.show()

Adaptive linear neurons and the convergence of learning

另一类型的单层神经网络：
ADAptive LInear NEuron (Adaline).

和RRosenblatt的感知机算法不同的地方在于权重的更新是基于线性激活函数的，而不是单位阶跃函数。Adaline的线性激活函数是网络输入的正比例函数：

ϕ (w T) = w T

线性激活函数用于学习权重，量化器（quantizer）和单位阶跃函数一样，可以用于预测分类标签，如下图所示：

如果我们比较之前的感知机算法，会发现区别在于线性激活函数的输出是连续的，而不是二值标签。

Minimizing cost functions with gradient descent

有监督机器学习的一个key ingredient就是定义目标函数，然后通过学习过程来优化它。这个目标函数一般都是cost function，在Adaline中，我们定义cost functionJ为输出和真实标签之间的方差（Sum of
Squared Errors，SSE)：

J (w) = 1 2 Σ i (y (i) - ϕ (z (i)) 2

12是为了求导方便添加的。和单位阶跃函数相比，线性激活函数的优点在于其cost function可微，而且是凸函数。这样我们就可以采用简单并且有效的优化算法——梯度下降法——找到权重最小化cost function。

如下图所示，我们可以将梯度下降法的原则描述为climbing down a hill till 局部最优或者全局最优。在每一轮迭代中，我们都按照规定的步长（学习率、梯度斜率确定）向梯度下降的方法走出一步。

利用梯度下降，我们可以通过下面的公式来更新权重：

w : = w + Δ w

权重改变量

Δw定义为负梯度乘以学习率：

Δ w = - η \nabla J (w)

计算cost function的梯度：（略）

Adaline的学习规则看起来和感知机差不多，但是他的输出是实数而不是整数标签。权重更新是基于所有样本的，而不是每一个样本计算后都更新，which is why this approach
is also referred to as “batch” gradient descent.

Implementing an Adaptive Linear Neuron in Python

Adaline和感知机算法很相似，所以直接采用了感知机的实现，稍微更改一下fit方法，采用梯度下降法来更新权重：

class AdalineGD(object):"""ADAptive LInear NEuron classifier.    Parameters    ------------    eta : float    Learning rate (between 0.0 and 1.0)    n_iter : int    Passes over the training dataset.    Attributes    -----------    w_ : 1d-array    Weights after fitting.    errors_ : list    Number of misclassifications in every epoch.    """    def __init__(self, eta=0.01, n_iter=50):    self.eta = eta    self.n_iter = n_iter    def fit(self, X, y):    """ Fit training data.        Parameters        ----------        X : {array-like}, shape = [n_samples, n_features]        Training vectors,        where n_samples is the number of samples and        n_features is the number of features.        y : array-like, shape = [n_samples]        Target values.        Returns        -------        self : object        """        self.w_ = np.zeros(1 + X.shape[1])        self.cost_ = []        for i in range(self.n_iter):        output = self.net_input(X)        errors = (y - output)        """矩阵向量乘法"""        self.w_[1:] += self.eta * X.T.dot(errors)        self.w_[0] += self.eta * errors.sum()        """在整个训练集的基础上计算梯度"""        cost = (errors**2).sum() / 2.0        self.cost_.append(cost)    return selfdef net_input(self, X):    """Calculate net input"""    return np.dot(X, self.w_[1:]) + self.w_[0]def activation(self, X):    """Compute linear activation"""    return self.net_input(X)def predict(self, X):    """Return class label after unit step"""    return np.where(self.activation(X) >= 0.0, 1, -1)

选择更好的学习率：

>>> fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))>>> ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y)>>> ax[0].plot(range(1, len(ada1.cost_) + 1),... np.log10(ada1.cost_), marker='o')>>> ax[0].set_xlabel('Epochs')>>> ax[0].set_ylabel('log(Sum-squared-error)')>>> ax[0].set_title('Adaline - Learning rate 0.01')>>> ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y)>>> ax[1].plot(range(1, len(ada2.cost_) + 1),... ada2.cost_, marker='o')>>> ax[1].set_xlabel('Epochs')>>> ax[1].set_ylabel('Sum-squared-error')>>> ax[1].set_title('Adaline - Learning rate 0.0001')>>> plt.show()

特征尺度方法：标准化（feature scaling method called standardization），标准化后的特征均值为0，特征列标准偏差为1.比如，为了标准化第j个特征，我们简单的减去均值μj然后除以他的标准偏差：

x j = x j - μ j σ j

用NumPy的mean和std可以轻松实现标准化：

>>> X_std = np.copy(X)>>> X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()>>> X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

标准化之后，再次训练Adaline，学习率取0.01看是否收敛：

>>> ada = AdalineGD(n_iter=15, eta=0.01)>>> ada.fit(X_std, y)>>> plot_decision_regions(X_std, y, classifier=ada)>>> plt.title('Adaline - Gradient Descent')>>> plt.xlabel('sepal length [standardized]')>>> plt.ylabel('petal length [standardized]')>>> plt.legend(loc='upper left')>>> plt.show()>>> plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')>>> plt.xlabel('Epochs')>>> plt.ylabel('Sum-squared-error')>>> plt.show()

由图可知，Adaline现在是收敛的。但是尽管所有样本都被正确分类，SSE也是非零的。

Large scale machine learning and stochastic gradient descent

上面的方法在基于整个训练集的计算后reevalue一下模型然后再次计算，如果数据规模变大了的话是很浪费时间空间的，所以引入了随机梯度下降法，也叫iterative or on-line gradient descent。

对每一个训练样本递增的更新权重：

η (y (i) - ϕ (z (i))) x (i)

虽然随机梯度下降法可以认为是梯度下降法的优化，但是它仅仅是因为更频繁的更新权重从而更快收敛。因为每一个梯度都基于单个训练样本上计算，误差平面比梯度下降更加noiser，这样的话可以更容易避开浅的局部最小值。为了得到精确的结果，我们需要把数据顺序打乱。
另一个优点就是可以用于在线学习，这样系统可以立即适应变化然后训练数据在更新模型之后可以舍弃掉。（节约内存）

再改一下代码用于实现在线学习：

from numpy.random import seedclass AdalineSGD(object):    """ADAptive LInear NEuron classifier.    Parameters    ------------    eta : float    Learning rate (between 0.0 and 1.0)    n_iter : int    Passes over the training dataset.    Attributes    -----------    w_ : 1d-array    Weights after fitting.    errors_ : list    Number of misclassifications in every epoch.    shuffle : bool (default: True)    Shuffles training data every epoch    if True to prevent cycles.    random_state : int (default: None)    Set random state for shuffling    and initializing the weights.    """    def __init__(self, eta=0.01, n_iter=10,    shuffle=True, random_state=None):        self.eta = eta        self.n_iter = n_iter        self.w_initialized = False        self.shuffle = shuffle        if random_state:            seed(random_state)    def fit(self, X, y):        """ Fit training data.        Parameters        ----------        X : {array-like}, shape = [n_samples, n_features]        Training vectors, where n_samples        is the number of samples and        n_features is the number of features.        y : array-like, shape = [n_samples]        Target values.        Returns        -------        self : object        """        self._initialize_weights(X.shape[1])        self.cost_ = []        for i in range(self.n_iter):        if self.shuffle:        X, y = self._shuffle(X, y)        cost = []        for xi, target in zip(X, y):        cost.append(self._update_weights(xi, target))        avg_cost = sum(cost)/len(y)        self.cost_.append(avg_cost)        return self    def partial_fit(self, X, y):        """Fit training data without reinitializing the weights"""        if not self.w_initialized:        self._initialize_weights(X.shape[1])        if y.ravel().shape[0] > 1:        for xi, target in zip(X, y):        self._update_weights(xi, target)        else:        self._update_weights(X, y)        return self    def _shuffle(self, X, y):        """Shuffle training data"""        r = np.random.permutation(len(y))#产生0~100之间的随机序列值，用于as indices to shuffle our feature matrix and class label vector        return X[r], y[r]    def _initialize_weights(self, m):        """Initialize weights to zeros"""        self.w_ = np.zeros(1 + m)        self.w_initialized = True    def _update_weights(self, xi, target):        """Apply Adaline learning rule to update the weights"""        output = self.net_input(xi)        error = (target - output)        self.w_[1:] += self.eta * xi.dot(error)        self.w_[0] += self.eta * error        cost = 0.5 * error**2        return cost    def net_input(self, X):        """Calculate net input"""        return np.dot(X, self.w_[1:]) + self.w_[0]    def activation(self, X):        """Compute linear activation"""        return self.net_input(X)    def predict(self, X):        """Return class label after unit step"""        return np.where(self.activation(X) >= 0.0, 1, -1)

画出结果：

>>> ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)>>> ada.fit(X_std, y)>>> plot_decision_regions(X_std, y, classifier=ada)>>> plt.title('Adaline - Stochastic Gradient Descent')>>> plt.xlabel('sepal length [standardized]')>>> plt.ylabel('petal length [standardized]')>>> plt.legend(loc='upper left')>>> plt.show()>>> plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')>>> plt.xlabel('Epochs')>>> plt.ylabel('Average Cost')>>> plt.show()

the average cost goes down pretty quickly, and the final decision boundary after 15 epochs looks similar to the batch gradient descent with Adaline. If we want to update our model—for example, in an on-line learning scenario with streaming data—we could simply call the partial_fit method on individual samples—for instance, `ada.partial_fit(X_std[0, :], y[0]).

0 0