机器学习实战笔记（1）——kNN（k Nearest Neighbor）算法

来源：互联网发布：教师培训课程大数据编辑：程序博客网时间：2024/06/08 08:19

简述

kNN算法（中文翻译：k-近邻算法）是机器学习分类算法的基础部分，也是比较简单的算法之一。它的内容和原理并不复杂，但是计算量比较大，即时间复杂度和空间复杂度都比较高。书中以约会网站和手写数字识别系统为例。在这里，笔者也将从这两个例子下手，但是对部分代码进行了改进，以便适应Python3的编程环境。

算法描述

kNN的k指的是在新数据与样本数据进行比对时，只选取前k个最相近的数据。
kNN算法就是对未知类别属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的点与当前点之间的距离（欧氏距离： $d = (x A - x B) 2 + (y A - y B) 2 - - - - - - - - - - - - - - - - - - - \sqrt$
）；
按照距离递增次序排序；
选取与当前点距离最小的k个点；
确定前k个点所在类别的出现频率；
返回前k个点出现频率最高的类别作为当前点的预测分类。

特点

优点：精度高，对异常值不敏感，无数据输入假定。
缺点：计算复杂度较高，空间复杂度较高。
适用范围：数值型和标称型。数据需带有目标数据，即人工标签。标签形式可以是文件名，也可以是文档内的某一列。

算法处理一般流程

收集数据
准备数据：结构化的数据格式，有自己的数据格式即可。
分析数据
训练算法：此步骤不适用于kNN，但是为了明确一般流程，仍然加上。
测试算法：计算错误率。
使用算法：首先输入样本数据和结构化的输出结果，然后运行kNN算法判定数据分别属于哪一个分类，最后应用于分类的后续处理。

module

from numpy import *     # numpy matrix and array processimport operator         # sorted() function's 'key'parameterfrom os import listdir  # used to list the folder files

收集、解析数据

以文本文件的数据为例，提取其中的矩阵数据（一般以二维数据居多）和标签信息。

def file2matrix(filename):    """    txt file data change to matrix    @param filename: filename    @return: the read_matrix and the labels    """    with open(filename, mode='r') as fr:        array_lines = fr.readlines()    number_of_lines = len(array_lines)    return_mat = zeros((number_of_lines, 3))    class_label_vector = []    index = 0    for line in array_lines:        line = line.strip()        list_from_line = line.split('\t')        return_mat[index, :] = list_from_line[0:3]        class_label_vector.append(int(list_from_line[-1]))        index += 1    return return_mat, class_label_vector

算法核心代码实现

def classify0(inX, dataset, labels, k):    """    knn classify    @param inX: the input vector which is ready to be classified    @param dataset: the training data set    @param labels: labels vector    @param k: the k-th    @return: sorted result    """    dataset_size = dataset.shape[0]     # calculate the number of lines    diff_mat = tile(inX, (dataset_size, 1)) - dataset    sq_diff_mat = diff_mat**2    sq_distances = sq_diff_mat.sum(axis=1)    distances = sq_distances**0.5    sorted_distance_indices = distances.argsort()    class_count = {}    for i in range(k):        vote_i_label = labels[sorted_distance_indices[i]]        class_count[vote_i_label] = class_count.get(vote_i_label, 0) + 1    sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)    return sorted_class_count[0][0]

示例1：约会数据的分类

女主角Helen要在自己打了标签的数据里面得到一个模型，用以判断今后遇到的男生对她的魅力值和吸引力。
我们先来看一部分她打过标签的格式化数据：

40920   8.326976    0.953952    314488   7.153469    1.673904    226052   1.441871    0.805124    175136   13.147394   0.428964    138344   1.669788    0.134296    172993   10.141740   1.032955    135948   6.830792    1.213192    342666   13.276369   0.543880    367497   8.631577    0.749278    135483   12.273169   1.508053    350242   3.723498    0.831917    163275   8.385879    1.669485    15569    4.875435    0.728658    251052   4.680098    0.625224    1

从左至右依次是年飞行里程数、玩儿视频游戏所耗时间的百分比、每周消耗的冰激凌公升数以及最后的标签（1-3依次是不喜欢、喜欢和非常喜欢）PS. 话说貌似打会儿游戏还是很受欢迎的哈~
可将其绘制为“冰激凌-游戏时间图”如下：
dataset distribution figure

准备数据：归一化

归一化就是把数据范围限制在某个明确的范围之内，比如接下来我们就需要把数据统一到（0，1）范围内，方便后续的数据处理。代码如下：

def autonorm(dataset):    """    dataset normalization    @param dataset: np.array    @return: dataset after norm, ranges, minimal value    """    min_val = dataset.min(0)    max_val = dataset.max(0)    ranges = max_val - min_val    norm_dataset = zeros(shape(dataset))    m = dataset.shape[0]    norm_dataset = dataset - tile(min_val, (m, 1))    norm_dataset = norm_dataset/tile(ranges, (m, 1))    return norm_dataset, ranges, min_val

测试算法

编写针对此示例的算法测试代码：

def dating_class_test():    """    dating data test and see the error ratio    @return: the output on screen which shows the result and the error rate    """    ho_ratio = 0.1  # the ratio of test data    dating_data_mat, dating_labels = file2matrix('datingTestSet2.txt')    norm_mat, ranges, min_val = autonorm(dating_data_mat)    m = norm_mat.shape[0]    num_test_vec = int(m*ho_ratio)    error_count = 0.0    for i in range(num_test_vec):        # large scale data is used to be trained and small data is used to be test. 0:num_test_vec is small and        # num_test_vec:m is large        classify_result = classify0(norm_mat[i, :], norm_mat[num_test_vec:m, :], dating_labels[num_test_vec:m], 3)        print("the classifier came back with: %d, the real answer is %d" % (classify_result, dating_labels[i]))        if classify_result != dating_labels[i]:            error_count += 1.0    print("the total error rate is: %f%%" % (error_count/float(num_test_vec)*100.0))

代码结果演示如下：
result1_1

使用算法

将此算法应用于具体的应用之内，根据一个人的三个标签特征判断他对Helen的吸引力程度：

def classify_person():    """    Test the charm of a person to you    @return: print the result    """    result_list = ['not at all', 'in small doses', 'in large doses']    percent_games = float(input('Percentage of time spent playing video games: '))    length_miles = float(input('Frequent flier miles earned per year: '))    ice_cream = float(input('Liters of ice cream consumed per year: '))    dating_data_mat, dating_labels = file2matrix('datingTestSet2.txt')    norm_mat, ranges, min_val = autonorm(dating_data_mat)    in_arr = array([length_miles, percent_games, ice_cream])    class_fier_result = classify0((in_arr - min_val)/ranges, norm_mat, dating_labels, 3)    print('You will probably like this person: ', result_list[class_fier_result-1])

算法运行结果如下：
result1_2

怎么样，你是否也能捕获Helen的芳心呢（坏笑…）

示例2：手写识别系统

通过kNN算法将如下图所示的32*32数据进行判断：
test_num

准备数据：图像转换为测试向量

在这里，需要将数据从32*32转换为1*1024，有两种方法可行，第一种是书中的方法，即通过循环直接进行前后连接，第二种是直接使用numpy的flatten()方法，如下图所示：

这里以书中的方法为例：

def img2vector(filename):    """    change the 32*32 image matrix to 1*1024 array    @param filename: the data set filename    @return: the 1*1024 array    """    return_vect = zeros((1, 1024))    with open(filename) as fr:        for i in range(32):            line_str = fr.readline()            for j in range(32):                return_vect[0, 32*i+j] = int(line_str[j])    return return_vect

测试算法：使用kNN算法识别手写数字

def handwriting_class_test():    """    handwriting test    @return: screen output    """    hw_labels = []    training_file_list = listdir('trainingDigits')    m = len(training_file_list)    training_mat = zeros((m, 1024))    for i in range(m):        file_name_str = training_file_list[i]        file_str = file_name_str.split('.')[0]        class_num_str = int(file_str.split('_')[0])        hw_labels.append(class_num_str)        training_mat[i, :] = img2vector('trainingDigits/%s' % file_name_str)    test_file_list = listdir('testDigits')    error_count = 0.0    m_test = len(test_file_list)    for i in range(m_test):        file_name_str = test_file_list[i]        file_str = file_name_str.split('.')[0]        class_num_str = int(file_str.split('_')[0])        vector_under_test = img2vector('trainingDigits/%s' % file_name_str)        classifier_result = classify0(vector_under_test, training_mat, hw_labels, 3)        print('the classifier came back with: %d, the real answer is: %d' % (classifier_result, class_num_str))        if classifier_result != class_num_str:            error_count += 1.0    print('\nthe total number of errors is %d' % error_count)    print('\nthe total error rate is %f' % (error_count/float(m_test)))

运行结果如下：
result2

更改数据量及k值可改变错误率。实测将k值缩小后错误率可降低至0.0%。

测试代码

# coding=utf-8"""knn algorithm test file"""import kNNimport numpy as npimport matplotlibimport matplotlib.pyplot as pltgroup, labels = kNN.create_dataset()print(kNN.classify0([0, 0], group, labels, 3))dating_data_mat, dating_labels = kNN.file2matrix('datingTestSet2.txt')print(dating_data_mat)print(dating_labels)fig = plt.figure()ax = fig.add_subplot(111)# ax.scatter(dating_data_mat[:, 1], dating_data_mat[:, 2], 10*np.array(dating_labels), 10*np.array(dating_labels))type1_x = []type1_y = []type2_x = []type2_y = []type3_x = []type3_y = []for i in range(len(dating_labels)):    if dating_labels[i] == 1:   # unlike        type1_x.append(dating_data_mat[i][1])        type1_y.append(dating_data_mat[i][2])    if dating_labels[i] == 2:   # like        type2_x.append(dating_data_mat[i][1])        type2_y.append(dating_data_mat[i][2])    if dating_labels[i] == 3:   # very like        type3_x.append(dating_data_mat[i][1])        type3_y.append(dating_data_mat[i][2])type1 = ax.scatter(type1_x, type1_y, s=20)type2 = ax.scatter(type2_x, type2_y, s=30)type3 = ax.scatter(type3_x, type3_y, s=40)ax.legend((type1, type2, type3), ('unlike', 'like', 'very_like'))plt.xlabel('the Percentage of Playing Games')plt.ylabel('the Cost of Ice-Creams per Week')plt.title('the Data Set Distribution Figure')plt.legend()plt.show(fig)norm_mat, ranges, min_val = kNN.autonorm(dating_data_mat)print(norm_mat)print(ranges)print(min_val)# kNN.dating_class_test()# kNN.classify_person()# kNN.handwriting_class_test()

不定期更新，未完待续。。。

阅读全文

0 0