mahout 源码解析之聚类--MinHash

来源：互联网发布：vscode 代码高亮插件编辑：程序博客网时间：2024/05/29 11:01

MinHash作为一种聚类技术，具体的原理请参见淘宝的博客。

在Mahout的实现中其位于包org.apache.mahout.clustering.minhash中，HashFunction为一个接口，HashFactory作为一个工厂类，实现了四种哈希函数。

具体的实现位于类MinHashDriver中。这个类没有提供单机版本的实现，所做的是基于Map-Reduce的实现。在输入的参数上keyGroups是控制我们输出的key的长度，这个大家要注意下。

在类MinHashMapper的setup阶段，基本上是利用我们输入的参数将一些变量进行初始化。

protected void setup(Context context) throws IOException,InterruptedException {super.setup(context);Configuration conf = context.getConfiguration();this.numHashFunctions = conf.getInt(MinhashOptionCreator.NUM_HASH_FUNCTIONS, 10);this.minHashValues = new int[numHashFunctions];this.bytesToHash = new byte[4];this.keyGroups = conf.getInt(MinhashOptionCreator.KEY_GROUPS, 1);this.minVectorSize = conf.getInt(MinhashOptionCreator.MIN_VECTOR_SIZE,5);String htype = conf.get(MinhashOptionCreator.HASH_TYPE, "linear");this.debugOutput = conf.getBoolean(MinhashOptionCreator.DEBUG_OUTPUT,false);HashType hashType;try {hashType = HashType.valueOf(htype);} catch (IllegalArgumentException iae) {log.warn("No valid hash type found in configuration for {}, assuming type: {}",htype, HashType.LINEAR);hashType = HashType.LINEAR;}hashFunction = HashFactory.createHashFunctions(hashType,numHashFunctions);}

在map阶段，利用哈希函数创建numHashFunctions个哈希值，其中先利用Integer.MAX_VALUE来填充这个哈希数组，当利用哈希函数算的的结果小于这个值时，将替换这个值。

for (int i = 0; i < numHashFunctions; i++) {minHashValues[i] = Integer.MAX_VALUE;}for (int i = 0; i < numHashFunctions; i++) {for (Vector.Element ele : featureVector) {int value = (int) ele.get();bytesToHash[0] = (byte) (value >> 24);bytesToHash[1] = (byte) (value >> 16);bytesToHash[2] = (byte) (value >> 8);bytesToHash[3] = (byte) value;int hashIndex = hashFunction[i].hash(bytesToHash);// if our new hash value is less than the old one, replace the// old oneif (minHashValues[i] > hashIndex) {minHashValues[i] = hashIndex;}}}

在输出簇信息的时候，对于一个样本向量我们将输出numHashFunctions条记录，其中key为原始样本向量哈希值拼接的结果，value为原始簇编号或者原始的样本向量。

for (int i = 0; i < numHashFunctions; i++) {StringBuilder clusterIdBuilder = new StringBuilder();for (int j = 0; j < keyGroups; j++) {clusterIdBuilder.append(minHashValues[(i + j) % numHashFunctions]).append('-');}// remove the last dashclusterIdBuilder.deleteCharAt(clusterIdBuilder.length() - 1);Text cluster = new Text(clusterIdBuilder.toString());Writable point;if (debugOutput) {point = new VectorWritable(featureVector.clone());} else {point = new Text(item.toString());}context.write(cluster, point);}

在MinHashReducer类中，在Reduce阶段，我们将同一个key的map输出合并，当个数大于我们预先设定的阈值minClusterSize时，我们将其输出。

protected void reduce(Text cluster, Iterable<Writable> points,Context context) throws IOException, InterruptedException {Collection<Writable> pointList = Lists.newArrayList();for (Writable point : points) {if (debugOutput) {Vector pointVector = ((VectorWritable) point).get().clone();Writable writablePointVector = new VectorWritable(pointVector);pointList.add(writablePointVector);} else {Writable pointText = new Text(point.toString());pointList.add(pointText);}}if (pointList.size() >= minClusterSize) {context.getCounter(Clusters.ACCEPTED).increment(1);for (Writable point : pointList) {context.write(cluster, point);}} else {context.getCounter(Clusters.DISCARDED).increment(1);}}

MinHash聚类算法也可以说是一个降维的过程，将原始的样本向量相似度的计算转换成Hash值是否相同来计算。