数据挖掘

来源:互联网 发布:2016易语言源码大全 编辑:程序博客网 时间:2024/06/08 21:33

词集模型:单词构成的集合,每个单词只出现一次。

词袋模型:把每一个单词都进行统计,同时计算每个单词出现的次数。


在train_x中,总共有6篇文档,每一行代表一个样本即一篇文档。我们的目标是将train_x转化为可训练矩阵,即生成每个样本的词向量。可以对train_x分别建立词集模型,词袋模型来解决。

train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
               ["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
               ["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
               ["stop", "posting", "stupid", "worthless", "garbage"],
               ["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
               ["quit", "buying", "worthless", "dog", "food", "stupid"]]


1. 词集模型

算法步骤:

1)整合所有的单词到一个集合中,假设最终生成的集合长度为wordSetLen = 31。

2)假设文档/样本数为sampleCnt = 6,则建立一个sampleCnt * wordSetLen = 6 * 31的矩阵,这个矩阵被填入有效值之后,就是最终的可训练矩阵m。

3)遍历矩阵m,填入0,1有效值。0代表当前列的单词没有出现在当前行的样本/文档中,1代表当前列的单词出现在当前行的样本/文档中

4)最终生成一个6 * 31的可训练矩阵。


2. 词袋模型

词袋模型中,训练矩阵不仅仅只出现0,1还会出现其他数字,这些数字代表的是当前样本中单词出现的次数。

# -*- coding: utf-8 -*-import numpy as npdef load_data():    """ 1. 导入train_x, train_y """    train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],               ["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],               ["my", "dalmation", "is", "so", "cute", "I", "love", "him"],               ["stop", "posting", "stupid", "worthless", "garbage"],               ["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],               ["quit", "buying", "worthless", "dog", "food", "stupid"]]    label = [0, 1, 0, 1, 0, 1]    return train_x, labeldef setOfWord(train_x):    """ 2. 所有单词不重复的汇总到一个列表     train_x: 文档合集, 一个样本构成一个文档    wordSet: 所有单词生成的集合的列表    """    wordList = []        length = len(train_x)    for sample in range(length):        wordList.extend(train_x[sample])    wordSet = list(set(wordList))    return wordSetdef create_wordVec(sample, wordSet, mode="wordSet"):    """ 3. 将一个样本生成一个词向量 """    length = len(wordSet)    wordVec = [0] * length    if mode == "wordSet":        for i in range(length):            if wordSet[i] in sample:                wordVec[i] = 1    elif mode == "wordBag":        for i in range(length):            for j in range(len(sample)):                if sample[j] == wordSet[i]:                    wordVec[i] += 1    else:        raise(Exception("The mode must be wordSet or wordBag."))    return wordVecdef main(mode="wordSet"):    train_x, label = load_data()    wordSet = setOfWord(train_x)        sampleCnt = len(train_x)    train_matrix = []    for i in range(sampleCnt):        train_matrix.append(create_wordVec(train_x[i], wordSet, "wordBag"))    return train_matrix        if __name__ == "__main__":    train_x, label = load_data()    wordSet = setOfWord(train_x)    train_matrix = main("wordSet")