K邻近算法

来源：互联网发布：吴昕淘宝店叫什么名字编辑：程序博客网时间：2024/04/20 19:29

K-Nearest Neighbours是机器学习中最基本的聚类算法，它属于监督学习领域，用于模式识别，数据挖掘以及干扰检测等领域。

因为其不需要参数，所以在实际应用场景中被广泛应用，对于数据的分布也不需要做任何假设（例如高斯分布就是相反的例子）。

给定一些数据（也称为训练数据），它们根据自身属性的坐标做了分类。

例如，下表中的数据点包含两个特征：

现在已知另外一组数据点（测试数据），根据对训练集的分析结果对这些数据点进行聚类。注意没有被分类的点为黄色。

直观

如果我们将这些点画出来，我们就可以确定一些集簇。现在已知一些没有聚类的点，我们可以通过观察将它们归类到距离它们最近的类中。也就是说，一个点与一个红色的点的距离近，那么它也很可能被分类为红色。

直观地讲，点 (2.5, 7) 应该分类到蓝色的类中，点(5.5, 4.5)应该分到红色得类中。

算法

设m是训练数据样本的数目，p是一个未知的点。

将训练样本保存在一个数据点数组arr[]中，这就是说这个数组中的每一个元素都可表示为一个二元组(x, y)。
for i=0 to m: Calculate Euclidean distance d(arr[i], p).
得到集合S中K个最小的距离，数组中每个元素都与这些距离都相关。
返回S中主要的标签。

// C++ program to find groups of unknown// Points using K nearest neighbour algorithm.#include <bits/stdc++.h>using namespace std;struct Point{    int val;        // Group of point    int x, y;       // Co-ordinate of point    double distance;  // Distance from test point};bool comparison(Point a, Point b){     return (a.distance < b.distance);}// This function finds classification of point p using// k nearest neighbour algorithm. It assumes only two// groups and returns 0 if p belongs to group 0, else// 1 (belongs to group 1).int classifyAPoint(Point arr[], int n, int k, Point p){    // Fill distances of all points from p    for (int i = 0; i < n; i++)        arr[i].distance =            sqrt((arr[i].x - p.x) * (arr[i].x - p.x) +                 (arr[i].y - p.y) * (arr[i].y - p.y));    sort(arr, arr+n, comparison);    int freq1 = 0;      // Frequency of group 0    int freq2 = 0;      // Frequency of group 1    for (int i = 0; i < k; i++)    {        if (arr[i].val == 0)            freq1++;        else if (arr[i].val == 1)            freq2++;    }    return (freq1 > freq2 ? 0 : 1);}

0 0