Spark中文文本聚类

来源：互联网发布：2017最新淘宝口令红包编辑：程序博客网时间：2024/05/01 07:52

Spark文本聚类

Spark文本聚类
- Spark mlib简介
- 中文文本分词
- TFIDF特征
- word2vec介绍
- 文本表示
- Kmeans LDA聚类

聚类是常见的无监督学习算法，用于找到相似的Item，在无标记数据的情况下经常使用。这里考虑，当我们拥有大量文本，需要找到相似的文本（粗分类）时，使用Spark进行实验。

Spark mlib简介

mlib是Spark提供的机器学习算法库，提供特征工程、分类、回归、聚类、协同过滤等算法调用接口。
（1）对于特征工程主要包括：特征提取、特征变换、特征选择等。
提供常见的特征提取方法包括：TF-IDF，Word2Vec，CountVecmtorizer；
提供常见的特征变换方法包括：分句、去停用词、n-gram语言模型、二值化、PCA、DCT变换、One-hot编码等；
提供常见的特征选择算法：VectorSlicer、 R model Formula、卡方检验等。
（2）对于分类和回归，提供逻辑回归、决策树、随机森林、GBDT、MLP、朴素贝叶斯等方法。
（3）对于聚类，提供K-Means，LDA主题模型、高斯混合模型等。
（4）另外还支持协同过滤，便于搭建推荐系统。

中文文本分词

对于python而言，常见的分词工具有jieba，为了更加精细准确的进行中文分词，此处采用PyLTP工具（支持自定义词典）。

TFIDF特征

用词频表示文本特征，将文本进行量化，变成矩阵形式。矩阵内的每个元素就是词语对应的词频信息。矩阵大小为M*N，其中M表示文本数量，N表示词典中词语数量。TF表示某词在该文本内出现的频率，IDF表示Inverse Document Frequency, 某个词在文本中出现的频率。
IDF
|D|表示文本数量，DF(t,D)表示出现t词的文本数量
TFIDF = TF *IDF

word2vec介绍

word2vec是google开发的向量化词语的工具，实现方法是CBOW和Skip-Gram算法。Spark使用如下：

from pyspark.mllib.feature import Word2Vecword2vec = Word2Vec()model = word2vec.fit(input_data)

文本表示

为了将文本量化，使用TFIDF词频特征，再加上word embedding共同表示文本。TFIDF= S1（M*N矩阵），word2vec = S2（N*K矩阵）。文档表示为S = S1*S2 (M*K维矩阵)

Kmeans, LDA聚类

Kmeans聚类

from numpy import arrayfrom math import sqrtfrom pyspark.mllib.clustering import KMeans, KMeansModel# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point):    center = clusters.centers[clusters.predict(point)]    return sqrt(sum([x**2 for x in (point - center)]))WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))# Save and load modelclusters.save(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")sameModel = KMeansModel.load(sc, "target/org/apache/spark/PythonKMeansExample/KMeansModel")

LDA

from pyspark.mllib.clustering import LDA, LDAModelfrom pyspark.mllib.linalg import Vectors# Load and parse the datadata = sc.textFile("data/mllib/sample_lda_data.txt")parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))# Index documents with unique IDscorpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()# Cluster the documents into three topics using LDAldaModel = LDA.train(corpus, k=3)# Output topics. Each is a distribution over words (matching word count vectors)print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())      + " words):")topics = ldaModel.topicsMatrix()for topic in range(3):    print("Topic " + str(topic) + ":")    for word in range(0, ldaModel.vocabSize()):        print(" " + str(topics[word][topic]))# Save and load modelldaModel.save(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel")sameModel = LDAModel\    .load(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel")

阅读全文

0 0