实战Mahout聚类算法Canopy+K-means
来源:互联网 发布:windows清除dns缓存 编辑:程序博客网 时间:2024/06/01 12:25
转载:
原文来自:http://my.oschina.net/BreathL/blog/58104
聚类分析
数据模型
聚类算法K-means与Canopy
实战
01
public
static
void
run(Configuration conf, Path input, Path output,
02
DistanceMeasure measure,
double
t1,
double
t2,
double
convergenceDelta,
03
int
maxIterations)
04
throws
Exception{
05
Path directoryContainingConvertedInput =
new
Path(output,
06
DIRECTORY_CONTAINING_CONVERTED_INPUT);
07
log.info(
"Preparing Input"
);
08
InputDriver.runJob(input, directoryContainingConvertedInput,
09
"org.apache.mahout.math.RandomAccessSparseVector"
);
10
log.info(
"Running Canopy to get initial clusters"
);
11
CanopyDriver.run(conf, directoryContainingConvertedInput, output, measure,
12
t1, t2,
false
,
false
);
13
log.info(
"Running KMeans"
);
14
KMeansDriver.run(conf, directoryContainingConvertedInput,
new
Path(output,
15
Cluster.INITIAL_CLUSTERS_DIR), output, measure, convergenceDelta,
16
maxIterations,
true
,
false
);
17
// run ClusterDumper
18
ClusterDumper clusterDumper =
new
ClusterDumper(finalClusterPath(conf,
19
output, maxIterations),
new
Path(output,
"clusteredPoints"
));
20
clusterDumper.printClusters(
null
);
21
}
01
@Override
02
protected
void
map(LongWritable key, Text values, Context context)
throws
IOException, InterruptedException {
03
04
String[] numbers = SPACE.split(values.toString());
05
// sometimes there are multiple separator spaces
06
Collection<Double> doubles = Lists.newArrayList();
07
for
(String value : numbers) {
08
if
(!value.isEmpty()) {
09
doubles.add(Double.valueOf(value));
10
}
11
}
12
// ignore empty lines in data file
13
if
(!doubles.isEmpty()) {
14
try
{
15
Vector result = (Vector) constructor.newInstance(doubles.size());
16
int
index =
0
;
17
for
(Double d : doubles) {
18
result.set(index++, d);
19
}
20
VectorWritable vectorWritable =
new
VectorWritable(result);
21
context.write(
new
Text(String.valueOf(index)), vectorWritable);
22
23
}
catch
(InstantiationException e) {
24
throw
new
IllegalStateException(e);
25
}
catch
(IllegalAccessException e) {
26
throw
new
IllegalStateException(e);
27
}
catch
(InvocationTargetException e) {
28
throw
new
IllegalStateException(e);
29
}
30
}
31
}
1
hadoop jar ../mahout-examples-0.7-SNAPSHOT-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
2
-i <数据输入路径> \
3
-o <数据输出路径> \
4
-k <K值> \
5
-cl <是否先使用Canopy 算法初始化
true
or
false
>
6
-t1 <Canopy中的T1值>
7
-t2 <Canopy中的T2值>
8
-xm <执行方式: sequential(单机运行) or mapreduce(分布式)>
ClusterDump
clusters-2-final
1
hadoop jar /soft/mahout/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i ..(上一步的输出路径)/clusters-2-final/ --pointsDir ..(上一步输出路径)/clusteredPoints -o <要保持的文件名>
参考资料:
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
http://www.ibm.com/developerworks/cn/java/j-mahout-scaling/
http://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy3/
《Mahout in action》
https://cwiki.apache.org/MAHOUT/cluster-dumper.html
相关文章阅读《博主推荐》:
http://www.cnblogs.com/dlts26/archive/2011/08/23/2150227.html
http://blog.csdn.net/aidayei/article/details/6626699
- 实战Mahout聚类算法Canopy+K-means
- 实战Mahout聚类算法Canopy+K-means
- 实战Mahout聚类算法Canopy+K-means
- 实战Mahout聚类算法Canopy+K-means
- Mahout Canopy+K-Means
- mahout测试k-Means聚类算法
- Mahout clustering Canopy+K-means 源码分析
- mahout k-means实战
- mahout-0.6运行canopy聚类算法
- mahout之canopy聚类算法
- 划分方法聚类(三) Canopy+K-MEANS 算法解析
- Mahout实现k-means算法
- Mahout聚类算法canopy源码分析(1)
- Mahout源码canopy聚类算法分析(2)
- Mahout源码canopy聚类算法分析(3)
- mahout 源码解析之聚类--Canopy算法
- Canopy聚类算法与Mahout中的实现
- Mahout系列之Canopy聚类算法分析
- hadoop 输入乱码,非utf-8乱码
- 【unknown】WebSocket:(基于TCP的)通信协议
- oracle分页语句(注意rownum的用法)
- get和post的区别
- linux 动、静态库生成与使用
- 实战Mahout聚类算法Canopy+K-means
- HTML基础学习
- JSP常用指令
- 第三,四周主要自学内容
- 月球美容计划之维尼的背包(基础篇)
- 【MySQL】如何使用string.Format拼接MySQL命令,MySQL Table表名使用变量
- use C# Dictionary in Unity3d
- android design里的向上按钮
- Spring事务分析(2)--基于声明式的事务管理实现分析