RHadoop实现logistic-regression

来源：互联网发布：开放教育网络教育编辑：程序博客网时间：2024/06/06 17:47

参考文献:《Big Data Analytics With R And Hadoop》

代码来源:https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/tests/logistic-regression.R

# Copyright 2011 Revolution Analytics## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.## see spark implementation http://www.spark-project.org/examples.html## see nice derivation here http://people.csail.mit.edu/jrennie/writing/lr.pdflibrary(rmr2)## @knitr logistic.regression-signaturelogistic.regression =function(input, iterations, dims, alpha){## @knitr logistic.regression-maplr.map =function(., M) {Y = M[,1]X = M[,-1]keyval(1,Y * X *g(-Y * as.numeric(X %*% t(plane))))}## @knitr logistic.regression-reducelr.reduce =function(k, Z)keyval(k, t(as.matrix(apply(Z,2,sum))))## @knitr logistic.regression-mainplane = t(rep(0, dims))g = function(z) 1/(1 + exp(-z))for (i in 1:iterations) {gradient =values(from.dfs(mapreduce(input,map = lr.map,reduce = lr.reduce,combine = TRUE)))plane = plane + alpha * gradient }plane }## @knitr endout = list()test.size = 10^5for (be in c("local", "hadoop")) {rmr.options(backend = be)## create test setset.seed(0)## @knitr logistic.regression-dataeps = rnorm(test.size)testdata =to.dfs(as.matrix(data.frame(y = 2 * (eps > 0) - 1,x1 = 1:test.size,x2 = 1:test.size + eps)))## @knitr endout[[be]] =## @knitr logistic.regression-runlogistic.regression(testdata, 3, 2, 0.05)## @knitr end## max likelihood solution diverges for separable dataset, (-inf, inf) such as the above}stopifnot(isTRUE(all.equal(out[['local']], out[['hadoop']], tolerance = 1E-7)))

这段代码要比《Big Data Analytics With R And Hadoop》提供的代码好的多。业界人总说《Big Data Analytics With R And Hadoop》的好。实际上真正看起来实践起来才发现问题所在。然后拿问题去问那些好评的专家才发现。原来他们根本就没去实现过里面的例子。:-)，确实有点讽刺。但事实就是如此。网上找的代码也是复制粘贴。严重怀疑他们是否真的去跑过程序了。

0 0