Learning k for kNN Classification--论文笔记

来源：互联网发布：存款数据编辑：程序博客网时间：2024/06/05 04:48

一、下载链接

《Learning k for kNN Classification》下载地址

二、基本信息

（1）出处

ACM Transactions on Intelligent Systems and Technology, Vol. 8, No. 3, Article 43, Publication date: January 2017.

（2）摘要

1.原版kNN的不足

a.
对于所有的测试集都使用一样的k值，这在实际应用环境中是不切实际的。
例如：二个分类标签的分类
这里写图片描述

b. 从 Section 2最后一段写明的创新点
以往的研究都只是分别地关注分类、回归、缺失值预测。本文提出的 CM-KNN（Correlation Matrix kNN ）将同时考虑分类、回归、缺失值预测三个应用场景。

2.本文的创新之处

　　充分利用训练数据集固有的先验知识：点间关系、噪音数据的抹除、数据局部结构的保留。
a. 通过对不同的测试数据点使用不同的k值进行训练，学习一个相关矩阵来重构测试数据点
b. 使用最小二乘损失函数来最小化每一个预测数据点的误差
http://blog.csdn.net/xierhacker/article/details/53257748
c. 提出使用拉普拉斯（调和算子）正则化来保存重构过程中数据的局部结构
http://blog.csdn.net/wsj998689aa/article/details/40303561
d. L1和L2正则化应用于学习出不同数据所对应的不同k值，期望消除重构过程中的冗余特征
http://blog.sina.com.cn/s/blog_71dad3ef010146c3.html
http://blog.csdn.net/u012162613/article/details/44261657

（3）从 Section 2最后一段写明的创新点

From the above three research directions and publications, previous studies on kNN methods always separately focused on classification, regression, and missing value imputation. In this article, we study a kNN approach for taking into account the drawbacks of the conventional kNN method, such as the fixed k value in the kNN method, the removal of noisy data points, and the preservation of the local structures of data. In particular, we apply the proposed kNN apporach to simultaneously conduct classification, regression, and missing value imputation.

三、需要了解的点

（1）

kNN classification has at least two open issues to be addressed [Zhang 2010; Zhu et al. 2007], that is, the similarity measurement between two data points and the selection of the k value.

（2）

The common conclusion of the first issue is that different applications need different distance measurements [Qin et al. 2007; Zhang et al. 2006; Zhu et al. 2011]

（3）

In the proposed reconstruction process, we advocate an ? 1 -norm regularizer to result in element-wise sparsity [Luo et al. 2014; Liu et al. 2015; Ye and Li 2016;Li and Pang 2009] to generate different k values for different test data points.

（4）

Then we use an ? 2,1 -norm regularizer to generate the row sparsity to remove the impact
of noisy data points [Yang et al. 2012; Zhu et al. 2013a, 2013b, 2014].

（5）

we employ a Locality Preserving Projection (LPP) [Niyogi 2004] regularizer (that is, a graph Laplacian regularizer) to preserve the local structure of training data in the reconstruction process.

四、本文的组织结构

1.

Section 2 briefly recalls the reports on the kNN method from research areas of classification, regression, and missing value imputation.

2.

Then the CM-kNN classification method is described in Section 3.

3.

The proposed approach is evaluated by conducting sets of experiments in Section 4.

4.

This research is concluded in Section 5.

五、 Section 2 -相关工作

（1）kNN分类

　　数据集在接近无穷大且在非常理想的情况下，kNN拥有卓越的性能。然而kNN分类器的性能容易受到如下影响：k值的选取、距离度量方法的选取。近年来，针对这些问题有很多研究。例如：

1.

the kNN incorporating Certainty Factor(kNN-CF) classification method can incorporate the certainty factor measure into the conventional kNN method so it can be applied to the beginning of the kNN classifi- cation to meet the need of imbalanced learning [Zhang 2010]. Moreover, the kNN-CF classification method can be easily extended to the dataset with skewed class distribution.

2.

Song et al. proposed two novel kNN approaches, that is, Locally Informative-kNN and Globally Informative-kNN, respectively, via designing new measure metrics for selecting a subset of the most informative data point from neighborhoods [Song et al. 2007].

3.

Vincent and Bengio modified the conventional kNN method to be the K-local Hyperplane Distance Nearest Neighbor (HkNN) method, which applied the collection of 15–70 nearest neighbors from each class to span a linear subspace for that class, followed by conducting classification based on distance to the linear subspaces [Vincent and Bengio 2001].

4.

Wang proposed a new measure to define the similarity between two data points using the number of neighborhoods for conducting a new kNN classifier [Wang 2006].

5.

Zhang et al. [2016] proposed a novel k Nearest Neighbor algorithm, which
is based on sample self-representation, sparse learning, and the technology of decision
tree.

6.

Sun et al. [2015] studied a new type of query based on the k-Nearest Neighbor temporal aggregate, which organizes the locations by integrating the spatial and temporal aggregate information.

7.

Tang et al. [2011] studied a new type of query that finds the k Nearest Neighboring Trajectories (k-NNT) with the minimum aggregated distance to a set of query points.

（2）kNN缺失值预测

1.

Zhang et al. proposed a Grey-Based kNN Iteration Imputation method [Zhang et al. 2007], which efficiently reduced the time complexity and got over the slow convergence rate of the classical missing value imputation method, that is, the EM (Expectation Maximiza- tion) algorithm.

2.

Based on nearest-neighbor imputation, Chen and Shao proposed some jackknife variance estimators, which are asymptotically unbiased and consistent for the example means [Chen and Shao 2001].

3.

Meesad and Hengpraprohm proposed an imputation method combing the kNN-based feature selection with kNN-based impu- tation. Differing from the conventional kNN method, their method first conducts feature selection, and then estimates missing values [Meesad and Hengpraprohm 2008].

4.

Garc´ ıa-Laencina et al. proposed to employ mutual information to design a feature- weighted distance metric for conducting kNN [Garc´ ıa-Laencina et al. 2009]

5.

Most recently, Zhang proposed a Shell Neighbors imputation method to select the left and right nearest neighbors of missing data for imputing missing data [Zhang 2011].

（3）kNN回归

　　在机器学习和数据挖掘领域，传统的kNN算法通常包含如下缺点：低效率、忽略距离计算中的特征权重。为了解决这些问题，相关研究如下：

1.

Hamed et al. proposed an interval regression method based on the conventional kNN method by taking advantage of the possibility distribution to choose the value of k of kNN method due to the limited example size [Hamed et al. 2012].

2.

Based on the observation that conventional kNN regression is sensitive to the selection of similarity metric, Yao proposed a general kNN framework to infer the similarity metric as a weighted combination of a set of base similarity measures [Yao and Ruzzo 2006].

3.

Navot et al. proposed a new nearest neighbor method to capture complete dependency of the target function [Navot et al. 2006].

六、CM-kNN（Correlation-Matrix kNN）

（1）符号

这里写图片描述

（2）重构（目标函数的相关知识及关键理解.note）

1.正则化参数

1.参数R1(W)：防止模型过拟合，使用 l1-norm替换l2-norm作为正则项
The l1 -norm has been proved to lead the optimal sparse W ∗ [Zhu et al. 2016a; Wang et al. 2014; Zhang et al. 2011].
the corresponding objective function is also called the Least Absolute Shrinkage and Selection Operator (LASSO) [Zhu et al. 2013a, 2014; Dong et al. 2015a]. It can generate element-wise sparsity in the optimal W ∗ , that is, irregular sparsity in the elements of the matrix W ∗ .
2. 参数R2(W)：去除测试集中的噪声数据，使用l2,1-norm
Consider the l2,1 -norm regularization term: It leads to the reconstruction process to generate the sparseness through the whole rows of W, that is, row sparsity for short [Zhu et al. 2016b; Chen et al. 2016; Li et al. 2016]
http://blog.csdn.net/jzwong/article/details/50700361
3. 参数R3(W)：LPP正则项，用于保护特征的局部结构，避免原数据在使用降维操作投影到新的空间中造成的特征丢失
In addition,we employ a Locality Preserving Projection (LPP) [Niyogi 2004] regularizer (that is, a graph Laplacian regularizer) to preserve the local structure of training data in the reconstruction process.
the goal of LPP is to ensure that k Nearest Neighbors of original data arecorrespondingly preserved in the new space after conducting dimensionality reduction.
this article considers to hold the local consistency of the structures of the data during the reconstruction process, in particular to preserve the local consistency of the structures of the features in the data points [Shi et al. 2013]

2.例子

假设我们通过计算式子（6），我们得到了最优解W*如下：

W*矩阵是5行3列矩阵，5行表示训练集的5个节点，3列表示测试集的3个节点。
1.
例如我们要预测测试集的节点1，需要查看W*的第一列，发现测试集节点1只和训练集节点4、5有关系，则对于测试集节点1设置k=2；
例如我们要预测测试集的节点2，需要查看W*的第二列，发现测试集节点2只和训练集节点1、3、5有关系，则对于测试集节点1设置k=3；
2.
我们发现W*的第二行全为0，表示训练集的节点2与所有的测试集节点都不相关，因此我们认为训练集的节点2是噪音节点。

3.总结式子（6）的三个正则化项

l1-norm确保会在W*中产生0值；l2,1确保我们可以移除噪音数据的影响；LPP项确保我们可以更进一步提高kNN算法的性能

（3）ALGORITHM 1

这里写图片描述

1. W的初始值

　　论文没有明确说明W的初始值，但是又指出W是在迭代过程中递减收敛的，因此我认为可以将W设置为最大值，即将W矩阵全部元素设置为1。（W矩阵元素取值范围为[-1,1]，这是我推测的）

2.其中每个符号的取值

这里写图片描述

3. 拉普拉斯矩阵L

a.第一轮
比如有两个test point(a,b)，还有四个training point（A,B,C,D）。第一轮假设a,b 与 A,B,C,D都有边连接，然后第一轮的拉普拉斯矩阵L就能构建了
b.从第二轮到第三轮

根据迭代算法第二轮得到的W矩阵，然后可以得到最新的test point 和 training point的图，根据这个图来构建本轮算法中的L

（4）ALGORITHM 2: The Pseudo of the CM-kNN Algorithm

这里写图片描述

七、实验

（1）数据集

这里写图片描述

数据集来源：

所用数据集大部分来源于UCI (University of California Irvine) dataset 2 and the LIBSVM (A Library for Support Vector Machines) website

各种各样的数据集：

低维数的数据集、高维数的数据集、二标签和多标签的数据集、非平衡的数据集（Climate dataset: 46_positive-494_negative；German dataset: 700_positive-300_negative）

（2）采用10折交叉验证法评估CM-kNN的性能　　

　　首先数据集随机的划分为10份，然后选取其中一个份作为测试集，其余9份作为训练集，记录这次分类的结果。为了降低可能的误差，将上述操作重复10次，最后的分类结果取10次计算的平均值。
a.分类问题性能的度量方法
　　分类的准确率。准确率越高，就认为性能越好。
b.回归、缺失值预测问题性能的度量方法
　　相关系数（预测值-观测值）、均方根误差（RMSE- root mean square error）。相关系数越大，RMSE越小，表示预测得越准确，性能越好。

（3）与CM-kNN对比的各种kNN

1.standard kNN

　　实验中，设置k=5

2.CV-kNN

　　 standard kNN的进化版。使用交叉验证法来确定参数k（k=1,2,…..10），

3. L-kNN（相比CM-kNN，它没有考虑噪音节点）

　　就是式子（6）中p2=0的情况

4.LL-kNN（相比CM-kNN，它没有考虑数据结构的局部一致性）

　　就是式子（6）中p3=0的情况

5.AD-kNN

　　AD-kNN integrates salient features of the kNN approach and adaptive kernel methods for conducting probability density estimation. Following the literature, we set the parameter k of AD-kNN with the Monte Carlo validation method by setting the maximum number of neighbors as 20 [Sahigara et al. 2014].
6.LMNN
　　LMNN learns a Mahanalobis distance metric for k nearest neighbor (kNN) classification by semi-definite programming. The metric is trained with the goal that the k nearest neighbors always belong to the same class while examples from different classes are separated by a large margin [Weinberger and Saul 2006].

（4）实验结果分析

1.分类

a.
　　CM-kNN相比其他方法来说，在任何数据集上都有更高的准确度
b.
　　CM-kNN比L-kNN更优，因此CM-kNN使用l2,1正则项去除噪声点。例如，在数据集Gisette和DDCclients上，CM-kNN的准确率分别比L-kNN的高3.6%和2.9%。这也表明数据集Gisette和DDCclients都存在噪声数据，且Gisette的比DDCclients多。
c.
　　CM-kNN比LL-kNN更优，因此CM-kNN使用LPP正则项来保护数据结构的局部一致性。例如，在数据集Gisette和Australian上，CM-kNN的正确率分别比LL-kNN高3.5%和3.2%。
d.
　　相比 stand-kNN，使用不同的k值kNN算法{ CM-kNN，CV-kNN，LL-kNN，AD-kNN、LMNN }都有更好的分类性能，这表明使用不同k值在kNN上是可行的。

2.回归、缺失值预测

八、总结

CM-kNN通过充分利用数据的先验知识，对于每个数据点都能学习出一个特定的k值

有良好的抗噪音能力

相对于现存的kNN算法，CM-kNN在分类、回归、缺失值预测三个方面都是高准确率和高效率的

九、Future Work

期望设计出一个非线性转化矩阵来学习测试集和训练集之间的关系

阅读全文

0 0