gensim doc2vec + sklearn kmeans 做文本聚类

来源:互联网 发布:淘宝直播微淘粉丝要求 编辑:程序博客网 时间:2024/06/13 11:55

前一篇用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。

选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。

选择sklearn库中的KMeans类。

程序如下:

# coding:utf-8import sysimport gensimimport numpy as npfrom gensim.models.doc2vec import Doc2Vec, LabeledSentencefrom sklearn.cluster import KMeansTaggededDocument = gensim.models.doc2vec.TaggedDocumentdef get_datasest():    with open("out/text_dict_cut.txt", 'r') as cf:        docs = cf.readlines()        print len(docs)    x_train = []    #y = np.concatenate(np.ones(len(docs)))    for i, text in enumerate(docs):        word_list = text.split(' ')        l = len(word_list)        word_list[l-1] = word_list[l-1].strip()        document = TaggededDocument(word_list, tags=[i])        x_train.append(document)    return x_traindef train(x_train, size=200, epoch_num=1):    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)    model_dm.save('model/model_dm')    return model_dmdef cluster(x_train):    infered_vectors_list = []    print "load doc2vec model..."    model_dm = Doc2Vec.load("model/model_dm")    print "load train vectors..."    i = 0    for text, label in x_train:        vector = model_dm.infer_vector(text)        infered_vectors_list.append(vector)        i += 1    print "train kmean model..."    kmean_model = KMeans(n_clusters=15)    kmean_model.fit(infered_vectors_list)    labels= kmean_model.predict(infered_vectors_list[0:100])    cluster_centers = kmean_model.cluster_centers_    with open("out/own_claasify.txt", 'w') as wf:        for i in range(100):            string = ""            text = x_train[i][0]            for word in text:                string = string + word            string = string + '\t'            string = string + str(labels[i])            string = string + '\n'            wf.write(string)    return cluster_centersif __name__ == '__main__':    x_train = get_datasest()    model_dm = train(x_train)    cluster_centers = cluster(x_train)


用娱乐和视频相关语料训练的,视频名作为专有名词却在分词的时候被分开是,致使效果不算太好。看来,做实体识别,识别出视频名等专有名词还是很有必要的。


原创粉丝点击