统计学习笔记（三）k近邻算法

来源：互联网发布：枪花和涅槃知乎编辑：程序博客网时间：2024/06/03 17:46

算法描述

k近邻算法（k-nearest neighbour）的输入是实例的特征向量，对应于特征空间的点；输出是实例的类别。k近邻法假定在给定的训练数据集里，其中的实例的类别是确定的。对于新的实例，根据其k个最近的实例的类别，通过表决的方法进行预测。

3.1 k近邻算法

算法3.1

输入：训练数据集T和实例的特征向量x^；
其中训练数据集
$T = {(x 1, y 1), (x 2, y 2), . . ., (x N, y N)}$
其中，xi∈X⊆Rn为实例的特征向量，yi∈Y⊆{c1,c2,...,cK}为实例的类别，i=1,2,...,N，x^=(x(1),x(2),...,x(M))，x(i)是特征向量的第i个参数，M是参数的个数；
输出：实例x所属的类y
（1）根据给定的距离度量，在训练集T里找出与x最邻近的k个点，涵盖这k个点的x的邻域记做Nk(x)。
（2）在Nk(x)中根据分类决策规则，决定x的分类y。
$y = arg max c j \sum x i \in N k (x) I (y i = c j), i = 1, 2, . . ., N; j = 1, 2, . . ., K; 其中 I (y = c i) = {1, 0, y = c i y \neq c i$

3.2 k近邻模型

3.2.1 模型

3.2.2 距离

特征空间中的距离是2个实例的相似程度的反映。k近邻模型的特征空间一般是n维实数向量空间Rn。距离一般使用欧氏距离，或者使用Lp距离或明可夫斯基距离。
设特征空间X是n维实数向量空间Rn，xi,xj∈X，xi^=(x(1)i,x(2)i,...,x(n)i)，xj^=(x(1)j,x(2)j,...,x(n)j)，xi^和xj^的距离定义为

L p (x i, x j) = (\sum l = 1 n | x (l) i - x (l) j | p) 1 p

当p=2，称为欧氏距离

L 2 (x i, x j) = (\sum l = 1 n | x (l) i - x (l) j | 2) 1 2

当p=1，称为曼哈顿距离

L 1 (x i, x j) = (\sum l = 1 n | x (l) i - x (l) j |)

当p

→∞，她是各个坐标差的最大值

L \infty (x i, x j) = max l | x (l) i - x (l) j |, l = 1, 2, . . ., n

3.2.3 k值得选择

通常使用交叉验证法选择一个最优的k值。

3.2.4 分类决策规则

表述很数学，我……

3.3 k近邻法的实现：kd树

3.3.1 构造kd树

算法3.2 构造平衡kd树

输入：k维空间数据集T={x1,x2,...,xN}，其中xi=(x(1)i,x(2)i,...,x(n)i)。
输出：一个kd树
（1）开始构造根节点，根节点对应于包含T的k维空间的超矩形区域。
选择x(l)，以T中所有实例的x(l)坐标的中位数为切分点，将这个超矩形区域切分成两个子区域。
由根节点生成深度为1的左右两个子节点，左子节点对应区域内所有点的x(l)坐标小于切分点的x(l)坐标，右子节点对应区域内所有点的x(l)坐标大于/等于切分点的x(l)坐标。
将落在切分超平面上的实例点保存在根节点。
（2）重复（1）知道两个子区域中没有实例存在时停止。

3.3.2 搜索kd树

算法3.3 用kd树的最近邻搜索

输入：已构造的kd树；目标点x；
输出：x的最近邻。
（1）在kd树中找出包含目标点的叶节点（区域）：从根节点出发递归地访问他的子节点。若目标点x坐标小于切分点的坐标，则移动到左子节点，否则移动到右子节点。直到叶子节点。
（2）以此节点作为当前最近点。
（3）递归地向上回退，对每个节点进行：
（a）如果该节点保存的实例点比当前最近点距离目标点更近，则以该实例点为当前最近点。
（b）当前最近点一定存在于该节点的一个子节点对应的区域。检查该子节点的父节点的另一个子节点对应的区域是否有更近的点。具体的，检查另一个子节点对应的区域是否与以目标节点为球心、以目标点与“当前最近点”间的距离为半径的超球体相交。
如果相交，可能在另一个子节点对应的区域内存在距目标点更近的点，移动到另一个子节点。接着，递归地进行最近邻搜索；
如果不相交，向上回退。
（4）当回退到根节点时，搜索结束。最后的“当前最近点”即为x的最近邻点。

代码

以下代码在Python3中调试通过。

（1）生成KD树

先上图。还是挺有意思的。
用Python生成的KD树
输入数据是一个2维向量集，也可支持多维，代码做了适配。

import numpy as npimport matplotlib.pyplot as pltimport copyimport math"""X,  feature vectorsY,  class of XD,  dimension of each of vectors."""# Construct initial to be classified dataD   = 2NUM = 50C = [ 'g', 'r', 'b' ]#X = np.array([ (3,5), (2,4), (1,1), (5,2), (1,5), (4,1) ])X = np.random.rand(NUM,D)Y = [ C[i] for i in np.random.randint(0,len(C),NUM) ]class KD_Node:    cur_trav = None             # cursor for traversal.    x_min = 0    x_max = 1    y_min = 0    y_max = 1    def __init__( self,                  point=None, split=None, color=None,                  L=None, R=None, father=None,                  scope={} ):        """        initiate a kd tree.        point: datum of this node        split: split plane for this node        L:     left son        R:     right son        father: father of this node, if root it's None        scope: area in hyperspace for each node.        """        self.point  = point        self.split  = split        self.color  = color        self.left   = L        self.right  = R        self.father = father        self.flag_trav = 0      # traversal flag.                                 #   bit 0 is notation for itself                                #   bit 1 is for its left son                                #   bit 2 is for its right son        self.scope = scope      # paint scope:                                #   x0: min of x                                #   x1: max of x                                #   y0: min of y                                #   y1: max of y    def clear_trav(self):        KD_Node.cur_trav = None        self.flag_trav = 0        if self.left:            self.left.clear_trav()        if self.right:            self.right.clear_trav()    def __iter__(self):        return self    def __next__(self):        # with non-iteration traverse the tree        cursor = None        if KD_Node.cur_trav == None:        # First time to use cur_trav, initiate.            KD_Node.cur_trav = self        cursor = KD_Node.cur_trav        while 1:            if cursor.flag_trav & 0X07 == 0X7:      # any node has flag with                                                    # value=3                                                     # that states a completion                                                    # of traversal.                if cursor.father == None:                    raise StopIteration                else:                    cursor = cursor.father            elif cursor.flag_trav & 0X01 == 0:      # if bit0 == 0,                cursor.flag_trav |= 0X01            # set bit0 = 1                #cursor = cursor            # not need. set cursor => self                break                               # BREAK! return current.            elif cursor.flag_trav & 0X02 == 0:      # if bit1==0, bit2==0                cursor.flag_trav |= 0X02            # set bit1 of self                if cursor.left != None:                    cursor = cursor.left            # set cursor => left son                else:                               # self.left is None, skip                    continue            elif cursor.flag_trav & 0X04 == 0:      # if bit2 == 0,                cursor.flag_trav |= 0X04            # set bit2 = 1                if cursor.right != None:                    cursor = cursor.right           # set cursor => right son                else:                    continue        KD_Node.cur_trav = cursor        return KD_Node.cur_travdef CreateKDT(node=None, data=None, color=None, father=None ):    """    TODO: DOC FOR CreateKDT    INPUT: node, the node itself?           data, [ (3,5), (2,4), (1,1) ]           father, the father    OUTPUT:     """    global C    if len(data) > 0:        global D        dim = D        var = np.var(data, axis=0)          # variance for each dimension        split = np.argmax(var)              # split for this node        pos = int(len(data)/2)        pos_list = np.argpartition(data[:,split], pos)        point = data[pos_list[pos]]         # point for this node        color = C[np.random.randint(0, len(C))]        cur_scope = {}                      # scope        if not father:            cur_scope = { 'x0': 0, 'x1': 6, # current scope is where the node is.                          'y0': 0, 'y1': 6 }# Or you can assign it the min and                                            # max of the graph.        else:                               # update cur_scope            cur_scope = copy.deepcopy(father.scope)            if father.split == 0:                if point[0] < father.point[0]:                    cur_scope['x1'] = father.point[0]                else:                    cur_scope['x0'] = father.point[0]            elif father.split == 1:                if point[1] < father.point[1]:                    cur_scope['y1'] = father.point[1]                else:                    cur_scope['y0'] = father.point[1]                        node = KD_Node( point=point, split=split, color=color, father=father,                        scope=cur_scope )        if len(data[pos_list[:pos]]) != 0:            node.left  = CreateKDT( node    = node.left,                                    data    = data[pos_list[:pos]],                                    color   = color,                                    father  = node )        if len(data[pos_list[(pos+1):]]) != 0:            node.right = CreateKDT( node    = node.right,                                    data    = data[pos_list[(pos+1):]],                                    color   = color,                                    father  = node )    return nodedef get_split_pos(data, split):    """return the position to split in data."""    pos = len(data)/2    return def preorder(node, depth=-1):    """    Preorder a KD node    """    print(node)    if node:        if node.left:            preorder(node.left)        if node.right:            preorder(node.right)def draw_KDT(kd):    """    Draw a plot in which each of data determined by a point and draw the classifying plane.    """    x_min = kd.x_min    x_max = kd.x_max    y_min = kd.y_min    y_max = kd.y_max    plt.figure(figsize=(6,6))    plt.xlabel("$x^{(1)}$")    plt.ylabel("$x^{(2)}$")    plt.title("Machine Learning: KD Tree")    plt.xlim(int(x_min),math.ceil(x_max))    plt.ylim(int(y_min),math.ceil(y_max))    ax = plt.gca()    ax.set_aspect(1)    plt.plot( [x_min, x_max, x_max, x_min, x_min],              [y_min, y_min, y_max, y_max, y_min] )    line_from = []              # split line from and to    line_to   = []    for node in kd:        if node.split == 0:            line_from = [ node.point[0], node.scope['y0'] ]            line_to   = [ node.point[0], node.scope['y1'] ]        if node.split == 1:            line_from = [ node.scope['x0'], node.point[1] ]            line_to   = [ node.scope['x1'], node.point[1] ]        plt.plot( [ line_from[0], line_to[0] ],                  [ line_from[1], line_to[1] ],                  'k-', linewidth=1 )        plt.scatter( node.point[0], node.point[1], color=node.color )    plt.show()    passdef find_knn(root, x):    passdef main():    kd = None    kd = CreateKDT(kd, X)    #kd.clear_trav()    draw_KDT(kd)if __name__ == "__main__":    main()

参考：
[1] http://blog.csdn.net/u010551621/article/details/44813299

0 0