K-means 算法

来源:互联网 发布:股市数据下载 编辑:程序博客网 时间:2024/06/02 01:39

hit2015spring

欢迎关注我的博客:http://blog.csdn.NET/hit2015spring

前期预备知识

在无监督的算法中,训练样本的标记信息是未知的,目标是通过对训练样本学习来揭示数据的内在性质和规律。聚类试图将数据集中的样本划分为若干个通常是不相交的子集,每个子集称为一个簇,就是一堆不知道标签的数据样本,这些样本中每一个都包含着一个n维特征向量xi=(xi1,xi2,,xin)

就是描述一个事物它具有n个特征,这些特征可以反映出一个物体它属于哪个类别。于是聚类算法将这些样本D划分为k个不相交的簇。例如有一群人,有穿红衣服的长头发,有绿衣服的长头发,白衣服短头发,黑衣服长头发。。。。。。简单划分为男生女生,这里要满足的一个度量指标就是wom衣服颜色和头发长短就是特征的两维。只是一群人,我们通过这些特征之间的联系来把他们分成为两类人。

(当然这个男生女生的标签是我们自己加的,在k-means聚类的过程中算法是不知道这个标签的,它只是根据这些特征的联系(就是距离)把认为是同一类的样本聚集在一起)。

这里面引入了距离的定义:
对于两个样本:xi=(xi1,xi2,,xin)xj=(xj1,xj2,,xjn), 两个样本之间求距离是:

distmk(xi,xj)=(u=1n|xiuxju|p)1p

表达式(1)叫做闵可夫斯基距离

p=2时,为欧氏距离

disted(xi,xj)=xixj2=u=1n|xiuxju|2

p=1时,为曼哈顿距离

distman(xi,xj)=xixj2=u=1n|xiuxju|

当然上述的属性度量是基于这些属性是有“序”的关系。就像:属性值为(1,2,3)1和3距离比较远,和2距离比较近。具体可以用具体的值度量的。当然还有无序的属性,就像:{红衣服,黑衣服,蓝衣服}这样的属性我们不能直接用属性的值进行计算,这里就用到了VDM距离进行计算。具体可以见西瓜书的描述p200.

k均值聚类

给定样本集 D=x1,x2,,xm,k均值算法对聚类所得到的簇划分C=C1,C2,,Ck,该算法可以使得平方误差最小化:

E=i=1kxCixμi22

这里面μi=1|Ci|xCix 其实就是分类簇Ci的均值向量。当然了E值越小,表明簇内部的样本越相似。

可是要得到这个最小化的解其实是很不容易的,于是k均值用的是一个贪心算法进行近似求解的。

伪代码如下:
1、根据事先选择好的k值,随机在原始样本中选择初值,这些初值就当做是k个中心

2、对所有的点1,2,,m,计算每个点跟这k个中心的距离。

3、每个点都能得到k个距离,选取最近的那个距离,把这个点归到该类别。

4、这下得到了这k个簇里面都有一些点了吧,计算这些点的中心点,然后更新一下这些k个簇的中心。

5、是否满足你要求的迭代条件,如果没有满足条件,从第2步继续重复。

具体的一个例子

例子
这里写图片描述
这里写图片描述
这里写图片描述

c++代码

#include <stdlib.h>#include <math.h>#include <time.h>#include <iostream>#include "k-means.h"using namespace std;KMeans::KMeans(int dimNum, int clusterNum){    m_dimNum = dimNum;    m_clusterNum = clusterNum;    m_means = new double*[m_clusterNum];    for(int i = 0; i < m_clusterNum; i++)    {        m_means[i] = new double[m_dimNum];        memset(m_means[i], 0, sizeof(double) * m_dimNum);    }    m_initMode = InitRandom;    m_maxIterNum = 100;    m_endError = 0.001;}KMeans::~KMeans(){    for(int i = 0; i < m_clusterNum; i++)    {        delete[] m_means[i];    }    delete[] m_means;}void KMeans::Cluster(const char* sampleFileName, const char* labelFileName){    // Check the sample file    ifstream sampleFile(sampleFileName, ios_base::binary);    assert(sampleFile);    int size = 0;    int dim = 0;    sampleFile.read((char*)&size, sizeof(int));    sampleFile.read((char*)&dim, sizeof(int));    assert(size >= m_clusterNum);    assert(dim == m_dimNum);    // Initialize model    Init(sampleFile);    // Recursion    double* x = new double[m_dimNum];   // Sample data    int label = -1;     // Class index    double iterNum = 0;    double lastCost = 0;    double currCost = 0;    int unchanged = 0;    bool loop = true;    int* counts = new int[m_clusterNum];    double** next_means = new double*[m_clusterNum];    // New model for reestimation    for(int i = 0; i < m_clusterNum; i++)    {        next_means[i] = new double[m_dimNum];    }    while(loop)    {        //clean buffer for classification        memset(counts, 0, sizeof(int) * m_clusterNum);        for(int i = 0; i < m_clusterNum; i++)        {            memset(next_means[i], 0, sizeof(double) * m_dimNum);        }        lastCost = currCost;        currCost = 0;        sampleFile.clear();        sampleFile.seekg(sizeof(int) * 2, ios_base::beg);        // Classification        for(int i = 0; i < size; i++)        {            sampleFile.read((char*)x, sizeof(double) * m_dimNum);            currCost += GetLabel(x, &label);            counts[label]++;            for(int d = 0; d < m_dimNum; d++)            {                next_means[label][d] += x[d];            }        }        currCost /= size;        // Reestimation        for(int i = 0; i < m_clusterNum; i++)        {            if(counts[i] > 0)            {                for(int d = 0; d < m_dimNum; d++)                {                    next_means[i][d] /= counts[i];                }                memcpy(m_means[i], next_means[i], sizeof(double) * m_dimNum);            }        }        // Terminal conditions        iterNum++;        if(fabs(lastCost - currCost) < m_endError * lastCost)        {            unchanged++;        }        if(iterNum >= m_maxIterNum || unchanged >= 3)        {            loop = false;        }        //DEBUG        //cout << "Iter: " << iterNum << ", Average Cost: " << currCost << endl;    }    // Output the label file    ofstream labelFile(labelFileName, ios_base::binary);    assert(labelFile);    labelFile.write((char*)&size, sizeof(int));    sampleFile.clear();    sampleFile.seekg(sizeof(int) * 2, ios_base::beg);    for(int i = 0; i < size; i++)    {        sampleFile.read((char*)x, sizeof(double) * m_dimNum);        GetLabel(x, &label);        labelFile.write((char*)&label, sizeof(int));    }    sampleFile.close();    labelFile.close();    delete[] counts;    delete[] x;    for(int i = 0; i < m_clusterNum; i++)    {        delete[] next_means[i];    }    delete[] next_means;}//N 为特征向量数void KMeans::Cluster(double *data, int N, int *Label){    int size = 0;    size = N;    assert(size >= m_clusterNum);    // Initialize model    Init(data,N);    // Recursion    double* x = new double[m_dimNum];   // Sample data    int label = -1;     // Class index    double iterNum = 0;    double lastCost = 0;    double currCost = 0;    int unchanged = 0;    bool loop = true;    int* counts = new int[m_clusterNum];    double** next_means = new double*[m_clusterNum];    // New model for reestimation    for(int i = 0; i < m_clusterNum; i++)    {        next_means[i] = new double[m_dimNum];    }    while(loop)    {        //clean buffer for classification        memset(counts, 0, sizeof(int) * m_clusterNum);        for(int i = 0; i < m_clusterNum; i++)        {            memset(next_means[i], 0, sizeof(double) * m_dimNum);        }        lastCost = currCost;        currCost = 0;        // Classification        for(int i = 0; i < size; i++)        {            for(int j = 0; j < m_dimNum; j++)                x[j] = data[i*m_dimNum+j];            currCost += GetLabel(x, &label);            counts[label]++;            for(int d = 0; d < m_dimNum; d++)            {                next_means[label][d] += x[d];            }        }        currCost /= size;        // Reestimation        for(int i = 0; i < m_clusterNum; i++)        {            if(counts[i] > 0)            {                for(int d = 0; d < m_dimNum; d++)                {                    next_means[i][d] /= counts[i];                }                memcpy(m_means[i], next_means[i], sizeof(double) * m_dimNum);            }        }        // Terminal conditions        iterNum++;        if(fabs(lastCost - currCost) < m_endError * lastCost)        {            unchanged++;        }        if(iterNum >= m_maxIterNum || unchanged >= 3)        {            loop = false;        }        //DEBUG        //cout << "Iter: " << iterNum << ", Average Cost: " << currCost << endl;    }    // Output the label file    for(int i = 0; i < size; i++)    {        for(int j = 0; j < m_dimNum; j++)            x[j] = data[i*m_dimNum+j];        GetLabel(x,&label);        Label[i] = label;    }    delete[] counts;    delete[] x;    for(int i = 0; i < m_clusterNum; i++)    {        delete[] next_means[i];    }    delete[] next_means;}void KMeans::Init(double *data, int N){    int size = N;    if(m_initMode ==  InitRandom)    {        int inteval = size / m_clusterNum;        double* sample = new double[m_dimNum];        // Seed the random-number generator with current time        srand((unsigned)time(NULL));        for(int i = 0; i < m_clusterNum; i++)        {            int select = inteval * i + (inteval - 1) * rand() / RAND_MAX;            for(int j = 0; j < m_dimNum; j++)                sample[j] = data[select*m_dimNum+j];            memcpy(m_means[i], sample, sizeof(double) * m_dimNum);        }        delete[] sample;    }    else if(m_initMode == InitUniform)    {        double* sample = new double[m_dimNum];        for(int i = 0; i < m_clusterNum; i++)        {            int select = i * size / m_clusterNum;            for(int j = 0; j < m_dimNum; j++)                sample[j] = data[select*m_dimNum+j];            memcpy(m_means[i], sample, sizeof(double) * m_dimNum);        }        delete[] sample;    }    else if(m_initMode == InitManual)    {        // Do nothing    }}void KMeans::Init(ifstream& sampleFile){    int size = 0;    sampleFile.seekg(0, ios_base::beg);    sampleFile.read((char*)&size, sizeof(int));    if(m_initMode ==  InitRandom)    {        int inteval = size / m_clusterNum;        double* sample = new double[m_dimNum];        // Seed the random-number generator with current time        srand((unsigned)time(NULL));        for(int i = 0; i < m_clusterNum; i++)        {            int select = inteval * i + (inteval - 1) * rand() / RAND_MAX;            int offset = sizeof(int) * 2 + select * sizeof(double) * m_dimNum;            sampleFile.seekg(offset, ios_base::beg);            sampleFile.read((char*)sample, sizeof(double) * m_dimNum);            memcpy(m_means[i], sample, sizeof(double) * m_dimNum);        }        delete[] sample;    }    else if(m_initMode == InitUniform)    {        double* sample = new double[m_dimNum];        for (int i = 0; i < m_clusterNum; i++)        {            int select = i * size / m_clusterNum;            int offset = sizeof(int) * 2 + select * sizeof(double) * m_dimNum;            sampleFile.seekg(offset, ios_base::beg);            sampleFile.read((char*)sample, sizeof(double) * m_dimNum);            memcpy(m_means[i], sample, sizeof(double) * m_dimNum);        }        delete[] sample;    }    else if(m_initMode == InitManual)    {        // Do nothing    }}double KMeans::GetLabel(const double* sample, int* label){    double dist = -1;    for(int i = 0; i < m_clusterNum; i++)    {        double temp = CalcDistance(sample, m_means[i], m_dimNum);        if(temp < dist || dist == -1)        {            dist = temp;            *label = i;        }    }    return dist;}double KMeans::CalcDistance(const double* x, const double* u, int dimNum){    double temp = 0;    for(int d = 0; d < dimNum; d++)    {        temp += (x[d] - u[d]) * (x[d] - u[d]);    }    return sqrt(temp);}ostream& operator<<(ostream& out, KMeans& kmeans){    out << "<KMeans>" << endl;    out << "<DimNum> " << kmeans.m_dimNum << " </DimNum>" << endl;    out << "<ClusterNum> " << kmeans.m_clusterNum << " </CluterNum>" << endl;    out << "<Mean>" << endl;    for(int i = 0; i < kmeans.m_clusterNum; i++)    {        for(int d = 0; d < kmeans.m_dimNum; d++)        {            out << kmeans.m_means[i][d] << " ";        }        out << endl;    }    out << "</Mean>" << endl;    out << "</KMeans>" << endl;    return out;}#pragma once#include <fstream>class KMeans{public:    enum InitMode    {        InitRandom,        InitManual,        InitUniform,    };    KMeans(int dimNum = 1, int clusterNum = 1);    ~KMeans();    void SetMean(int i, const double* u){ memcpy(m_means[i], u, sizeof(double) * m_dimNum); }    void SetInitMode(int i)             { m_initMode = i; }    void SetMaxIterNum(int i)           { m_maxIterNum = i; }    void SetEndError(double f)          { m_endError = f; }    double* GetMean(int i)  { return m_means[i]; }    int GetInitMode()       { return m_initMode; }    int GetMaxIterNum()     { return m_maxIterNum; }    double GetEndError()    { return m_endError; }    /*  SampleFile: <size><dim><data>...        LabelFile:  <size><label>...    */    void Cluster(const char* sampleFileName, const char* labelFileName);    void Init(std::ifstream& sampleFile);    void Init(double *data, int N);    void Cluster(double *data, int N, int *Label);    friend std::ostream& operator<<(std::ostream& out, KMeans& kmeans);private:    int m_dimNum;    int m_clusterNum;    double** m_means;    int m_initMode;    int m_maxIterNum;       // The stopping criterion regarding the number of iterations    double m_endError;      // The stopping criterion regarding the error    double GetLabel(const double* x, int* label);    double CalcDistance(const double* x, const double* u, int dimNum);};#include <iostream>#include "k-means.h"using namespace std;int main(){    double data[] = {        0.0, 0.2, 0.4,        0.3, 0.2, 0.4,        0.4, 0.2, 0.4,        0.5, 0.2, 0.4,        5.0, 5.2, 8.4,        6.0, 5.2, 7.4,        4.0, 5.2, 4.4,        10.3, 10.4, 10.5,        10.1, 10.6, 10.7,        11.3, 10.2, 10.9    };    const int size = 10; //Number of samples    const int dim = 3;   //Dimension of feature    const int cluster_num = 4; //Cluster number    KMeans* kmeans = new KMeans(dim,cluster_num);    int* labels = new int[size];    kmeans->SetInitMode(KMeans::InitUniform);    kmeans->Cluster(data,size,labels);    for(int i = 0; i < size; ++i)    {        printf("%f, %f, %f belongs to %d cluster\n", data[i*dim+0], data[i*dim+1], data[i*dim+2], labels[i]);    }    delete []labels;    delete kmeans;    return 0;}
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 客户说物流太慢了要退货怎么办 天猫买的手机商家不给发票怎么办 天猫超市下单付款后缺货怎么办 淘宝卖家填写假的单号不发货怎么办 天猫商家72小时未发货怎么办 天猫精灵方糖不按顺序播放怎么办 在天猫购物已付款不发货怎么办 淘宝退货商家收到货不退款怎么办 被有实名认证的闲鱼卖家骗了怎么办 我收到了方正的提示函怎么办 淘宝刷q币单被骗了怎么办 中通快递已签收但是东西丢了怎么办 手机不版本低不支持微信下载怎么办 淘宝虚拟商品不支持7天退货怎么办 卖虚拟物品遇到恶意退款买家怎么办 淘宝极速退款后卖家拒绝退款怎么办 我的天猫积分不让换券了怎么办 微信手机话费充错了怎么办 自己进货在淘宝卖被投诉假货怎么办 京东买的电器售后后服务差怎么办 京东到家申请退款卖家不处理怎么办 天猫买了假货商品下架了怎么办 淘宝本地生活服务不能入驻了怎么办 淘宝店铺名在电脑上搜索不到怎么办 已经将退货寄回店家硬说没有怎么办 微信申诉账号短信验证失败怎么办 京东账号换手机号收不到短信怎么办 我的手机收不到短信通知怎么办? 淘宝卖家发货物流单号写错了怎么办 商铺买东西不给调换大小怎么办 圆通快递物流信息一直没更新怎么办 中通快递三天没更新物流信息怎么办 快递已经到了物流信息不更新怎么办 天天快递查询不更新物流信息怎么办 买车下个月分期全部付清怎么办手续 天猫客服介入以后商家不退款怎么办 淘宝上买代购奢侈品买到假货怎么办 淘宝退货卖家收到货拒绝退款怎么办 没收到货但申请了退货退款怎么办 小米商城预约中德手机没货怎么办 电脑用百度网盘下载速度超慢怎么办