RHadoop实现kmeans聚类

来源：互联网发布：淘宝七了个三什么梗编辑：程序博客网时间：2024/05/23 02:29

参考文献:《Big Data Analytics With R And Hadoop》

代码来源：https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/tests/kmeans.R

# Copyright 2011 Revolution Analytics## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.library(rmr2)## @knitr kmeans-signaturekmeans.mr =function(P,num.clusters,num.iter,combine,in.memory.combine) {## @knitr kmeans-dist.fundist.fun =function(C, P) {apply(C,1,function(x)colSums((t(P) - x)^2))}## @knitr kmeans.mapkmeans.map =function(., P) {nearest = {if(is.null(C))sample(1:num.clusters,nrow(P),replace = TRUE)else {D = dist.fun(C, P)nearest = max.col(-D)}}if(!(combine || in.memory.combine))keyval(nearest, P)elsekeyval(nearest, cbind(1, P))}## @knitr kmeans.reducekmeans.reduce = {if (!(combine || in.memory.combine) )function(., P)t(as.matrix(apply(P, 2, mean)))elsefunction(k, P)keyval(k,t(as.matrix(apply(P, 2, sum))))}## @knitr kmeans-main-1C = NULLfor(i in 1:num.iter ) {C =values(from.dfs(mapreduce(P,map = kmeans.map,reduce = kmeans.reduce)))if(combine || in.memory.combine)C = C[, -1]/C[, 1]## @knitr end# points(C, col = i + 1, pch = 19)## @knitr kmeans-main-2if(nrow(C) < num.clusters) {C =rbind(C,matrix(rnorm((num.clusters -nrow(C)) * nrow(C)),ncol = nrow(C)) %*% C) }}C}## @knitr end## sample runs##out = list()for(be in c("local", "hadoop")) {rmr.options(backend = be)set.seed(0)## @knitr kmeans-dataP =do.call(rbind,rep(list(matrix(rnorm(10, sd = 10),ncol=2)),20)) +matrix(rnorm(200), ncol =2)## @knitr end# x11()# plot(P)# points(P)out[[be]] =## @knitr kmeans-runkmeans.mr(to.dfs(P),num.clusters = 12,num.iter = 5,combine = FALSE,in.memory.combine = FALSE)## @knitr end}# would love to take this step but kmeans in randomized in a way that makes it hard to be completely reprodubile# stopifnot(rmr2:::cmp(out[['hadoop']], out[['local']]))

在《Big Data Analytics With R And Hadoop》一书中kmeans代码有点问题。本人在用的时候出现过两个问题。一个是combine找不到。通过对比map发现少了个function。加

上即可。第二个问题是apply返回值问题，至今找不到原因。路过的朋友解决了还望指教。正是因为apply返回值出错才找到以上代码。一次测试，成功。

0 0