基于密度的算法DBscan

来源：互联网发布：reflector dll编程编辑：程序博客网时间：2024/04/29 02:52

基于密度的聚类算法 DBscan

基于密度的聚类算法的结果是球状的簇基于密度的聚类算法的结果可以是任意形状，这有利于处理带有噪音点的数据

DBscan 相关概念

点P的邻接半径 eps : 以点P为中心，以 eps 为半径
点P的eps邻域 ：与点P的距离 <= eps的所有点的集合
密度域值 minPts ：指定的一个数，表示最小的点的个数，它刻画了最小的密度情况，过滤掉密度稀疏的点
核心点 ：点 P 的eps 领域点的个数 >= minPts ，则 P 称为核心点
边界点 ：点 Q 的eps 领域点的个数 < minPts ，但是 Q 落在某个核心点 P 的eps邻域内，则点 Q 称为边界点
噪音点 ：点 R 既不是核心点也不是边界点，则 R 称为噪音点
直接密度可达 ：点 q 在点 p 的 eps 邻域内，则称 q 从 p 出发是直接密度可达
密度可达 ：对于对象链 P1,P2,……,Pn , 若 Pi+ z1 从 Pi 出发直接密度可达，则 Pn 从 P1 出发密度可达（传递性，间接密度可达）

DBscan 算法思想

判断两点之间是否直接密度可达

def eps_neighborhood(a, b, eps):    return dist(a, b) < eps

求某点的 eps 邻域

def region_query(dataSet, point_id, eps):    n_points = dataSet.shape[1] #shape函数是numpy.core.fromnumeric中的函数，它的功能是读取矩阵的长度    seeds = []    for i in range(0, n_points):        if eps_neighborhood(dataSet[:, point_id], dataSet[:, i], eps):            seeds.append(i)    return seeds

为核心对象聚类并合并

 合并两个存在密度相连的元素的集合

def expand_cluster(dataSet, clusterResults, point_id, cluster_id, eps, minPts):    seeds = region_query(dataSet, point_id, eps)    if len(seeds) < minPts:        clusterResults[point_id] = NOISE #标为噪音点        return False    else:        clusterResults[point_id] = cluster_id #划分到该簇        for seed_id in seeds:            clusterResults[seed_id] = cluster_id #该点的eps邻域也划分到该簇        while len(seeds) > 0: # 持续扩张，seeds里面的点的eps邻域一定与当前簇有密度可达的点，合并簇            current_point = seeds[0]            expand_seeds = region_query(dataSet, current_point, eps)            if len(expand_seeds) >= minPts: #如果 current_point是核心点                for expand_id in range(0, len(expand_seeds)):                    result_point = expand_seeds[expand_id]                    if clusterResults[result_point] == UNCLASSIFIED: #未分类的类做我的seed                        seeds.append(result_point)                        clusterResults[result_point] == cluster_id                    elif clusterResults[result_point] == NOISE: #已分类的类与我合并                        clusterResults[result_point] == cluster_id            seeds = seeds[1:]        return True

DBscan输出聚类结果

def dbscan(dataSet, eps, minPts):    cluster_id = 1    n_points = dataSet.shape[1]    clusterResults = [UNCLASSIFIED] * n_points    for point_id in range(0, n_points):        point = dataSet[:, point_id]        if clusterResults[point_id] == UNCLASSIFIED:            if expand_cluster(dataSet, clusterResults, point_id, cluster_id, eps, minPts):                cluster_id = cluster_id + 1    return clusterResults, cluster_id - 1

阅读全文

1 0