聚类算法学习----之----sklearn.cluster.DBSCAN
来源:互联网 发布:mac科学绘图 编辑:程序博客网 时间:2024/05/19 11:46
class DBSCAN(BaseEstimator, ClusterMixin): """Perform DBSCAN clustering from vector array or distance matrix. DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density. Read more in the :ref:`User Guide <dbscan>`. Parameters ---------- eps : float, optional 同一个簇中样本的最大距离 默认:0.5 min_samples : int, optional 一个簇中的至少需要包含的样本数 默认:5 metric : string, or callable 最距离公式,可以用默认的欧式距离,还可以自己定义距离函数 默认:euclidean metric_params : dict, optional 默认:None Additional keyword arguments for the metric function. algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional 最近邻搜索算法参数 默认:auto brute是蛮力实现, kd_tree是KD树实现, ball_tree是球树实现, auto则会在三种算法中做权衡,选择一个拟合最好的最优算法 leaf_size : int, optional (default = 30) 使用KD树或者球树时,停止建子树的叶子节点数量的阈值 默认:30 (最近邻搜索算法的参数) p : float, optional 只用于闵可夫斯基距离和带权重闵可夫斯基距离中p值的选择 默认:None p=1为曼哈顿距离, p=2为欧式距离 n_jobs : int, optional (default = 1) 使用的进程数量,默认为:1 若值为 -1,则用所有的CPU进行运算 Attributes ---------- core_sample_indices_ : 核心点的索引 因为labels_不能区分核心点还是边界点,所以需要用这个索引确定核心点 components_ : array, shape = [n_core_samples, n_features]#核心点 labels_ : array, shape = [n_samples] 每个点所属集群的标签,-1代表噪声 """ def __init__(self, eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=1): self.eps = eps self.min_samples = min_samples self.metric = metric self.metric_params = metric_params self.algorithm = algorithm self.leaf_size = leaf_size self.p = p self.n_jobs = n_jobs def fit(self, X, y=None, sample_weight=None): """Perform DBSCAN clustering from features or distance matrix. Parameters ---------- X : 需要分类的数据 sample_weight : 样本点的权重 y : Ignored """ X = check_array(X, accept_sparse='csr') clust = dbscan(X, sample_weight=sample_weight, **self.get_params()) self.core_sample_indices_, self.labels_ = clust if len(self.core_sample_indices_): # fix for scipy sparse indexing issue self.components_ = X[self.core_sample_indices_].copy() else: # no core samples self.components_ = np.empty((0, X.shape[1])) return self def fit_predict(self, X, y=None, sample_weight=None): """Performs clustering on X and returns cluster labels. Parameters ---------- X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \ array of shape (n_samples, n_samples) A feature array, or array of distances between samples if ``metric='precomputed'``. sample_weight : array, shape (n_samples,), optional Weight of each sample, such that a sample with a weight of at least ``min_samples`` is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1. y : Ignored Returns ------- y : ndarray, shape (n_samples,) cluster labels """ self.fit(X, sample_weight=sample_weight) return self.labels_
测试:
import pandas as pdfrom sklearn.cluster import DBSCANdef main(): stopList= [{'id': '105792','lat': 28.571906,'lng': 112.337788}, {'id': '55792','lat': 28.573678,'lng': 112.381103}, { 'id': '500792','lat': 28.571915,'lng': 112.337533}, { 'id': '5000105792','lat': 28.573978,'lng': 112.35765}, { 'id': '0105792','lat': 28.572656,'lng': 112.3366}, {'id': '50005792', 'lat': 28.578011, 'lng': 112.330688}, {'id': '5000105792', 'lat': 28.572228, 'lng': 112.335841}, {'id': '500105792', 'lat': 28.57849, 'lng': 112.3338}, {'id': '5005792', 'lat': 28.57239, 'lng': 112.336491}, {'id': '105792', 'lat': 28.577943, 'lng': 112.330995}, {'id': '792', 'lat': 28.571921, 'lng': 112.337783}, {'id': '505792', 'lat': 28.572401, 'lng': 112.3359}, {'id': '500092', 'lat': 28.569629, 'lng': 112.34005}, {'id': '50092', 'lat': 28.588048, 'lng': 112.337783}, {'id': '505792', 'lat': 28.572035, 'lng': 112.335683}, {'id': '05792', 'lat': 28.560938, 'lng': 112.378183}, {'id': '55792', 'lat': 28.544781, 'lng': 112.494936}, {'id': '505792', 'lat': 28.572296, 'lng': 112.336288}, {'id': '505792', 'lat': 28.571951, 'lng': 112.337806}, {'id': '55792', 'lat': 28.571551, 'lng': 112.32685}] print('共有%d个点'%len(stopList)) initdata = pd.DataFrame(stopList) scatterData = initdata[['lat', 'lng']] ## 选择需要显示的字段:经纬度 modle = DBSCAN(eps=0.0003,min_samples=2) # dbscan resluts = modle.fit(scatterData) # 聚类 labels = resluts.labels_ # 每个点所属簇的索引构成的列表 print('labels\n', labels) print('core_sample_indices_\n',resluts.core_sample_indices_)#核心点的索引 print('components_\n', resluts.components_) # 核心点if __name__ == '__main__': main()
运行结果:
共有20个点
labels
[ 0 -1 0 -1 1 -1 2 -1 1 -1 0 2 -1 -1 2 -1 -1 1 0 -1]
core_sample_indices_
[ 0 2 4 6 8 10 11 14 17 18]
components_
[[ 28.571906 112.337788]
[ 28.571915 112.337533]
[ 28.572656 112.3366 ]
[ 28.572228 112.335841]
[ 28.57239 112.336491]
[ 28.571921 112.337783]
[ 28.572401 112.3359 ]
[ 28.572035 112.335683]
[ 28.572296 112.336288]
[ 28.571951 112.337806]]
阅读全文
0 0
- 聚类算法学习----之----sklearn.cluster.DBSCAN
- 聚类算法学习----之----sklearn.cluster.KMeans
- 聚类之DBSCAN学习
- 聚类算法之DBScan(Java实现)
- 聚类算法之DBScan(Java实现)
- 聚类算法之DBScan(Java实现)
- 聚类算法之DBScan(C++)
- 聚类算法:DBSCAN
- DBSCAN聚类算法
- DBSCAN 聚类算法
- DBSCAN聚类算法
- DBSCAN聚类算法
- DBSCAN聚类算法
- 机器学习算法之聚类算法(K-means&DBSCAN)
- 聚类算法:DBScan算法
- 算法学习(一):DBSCAN聚类算法
- 聚类算法之密度聚类算法DBSCAN
- sklearn DBSCAN
- 安装mysql5.0时出现:write configuration file
- javascript聚焦input输入标签
- 【机器学习】Xgboost原理
- 装饰器
- STM32中晶振的原理与作用
- 聚类算法学习----之----sklearn.cluster.DBSCAN
- java中静态变量和非静态变量的区别
- java匿名内部类
- 突破WaitForMultipleObject等待限制
- 如何获取文件夹里面的文本信息
- kubernetes(k8s)安装部署
- mongodb非关系型数据库存日志
- Java IO流
- Python中input和raw_input的区别