数据挖掘
来源:互联网 发布:2016易语言源码大全 编辑:程序博客网 时间:2024/06/08 21:33
词集模型:单词构成的集合,每个单词只出现一次。
词袋模型:把每一个单词都进行统计,同时计算每个单词出现的次数。
在train_x中,总共有6篇文档,每一行代表一个样本即一篇文档。我们的目标是将train_x转化为可训练矩阵,即生成每个样本的词向量。可以对train_x分别建立词集模型,词袋模型来解决。
train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
["stop", "posting", "stupid", "worthless", "garbage"],
["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
["quit", "buying", "worthless", "dog", "food", "stupid"]]
1. 词集模型
算法步骤:
1)整合所有的单词到一个集合中,假设最终生成的集合长度为wordSetLen = 31。
2)假设文档/样本数为sampleCnt = 6,则建立一个sampleCnt * wordSetLen = 6 * 31的矩阵,这个矩阵被填入有效值之后,就是最终的可训练矩阵m。
3)遍历矩阵m,填入0,1有效值。0代表当前列的单词没有出现在当前行的样本/文档中,1代表当前列的单词出现在当前行的样本/文档中。
4)最终生成一个6 * 31的可训练矩阵。
2. 词袋模型
词袋模型中,训练矩阵不仅仅只出现0,1还会出现其他数字,这些数字代表的是当前样本中单词出现的次数。
# -*- coding: utf-8 -*-import numpy as npdef load_data(): """ 1. 导入train_x, train_y """ train_x = [["my", "dog", "has", "flea", "problems", "help", "please"], ["maybe", "not", "take", "him", "to", "dog", "park", "stupid"], ["my", "dalmation", "is", "so", "cute", "I", "love", "him"], ["stop", "posting", "stupid", "worthless", "garbage"], ["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"], ["quit", "buying", "worthless", "dog", "food", "stupid"]] label = [0, 1, 0, 1, 0, 1] return train_x, labeldef setOfWord(train_x): """ 2. 所有单词不重复的汇总到一个列表 train_x: 文档合集, 一个样本构成一个文档 wordSet: 所有单词生成的集合的列表 """ wordList = [] length = len(train_x) for sample in range(length): wordList.extend(train_x[sample]) wordSet = list(set(wordList)) return wordSetdef create_wordVec(sample, wordSet, mode="wordSet"): """ 3. 将一个样本生成一个词向量 """ length = len(wordSet) wordVec = [0] * length if mode == "wordSet": for i in range(length): if wordSet[i] in sample: wordVec[i] = 1 elif mode == "wordBag": for i in range(length): for j in range(len(sample)): if sample[j] == wordSet[i]: wordVec[i] += 1 else: raise(Exception("The mode must be wordSet or wordBag.")) return wordVecdef main(mode="wordSet"): train_x, label = load_data() wordSet = setOfWord(train_x) sampleCnt = len(train_x) train_matrix = [] for i in range(sampleCnt): train_matrix.append(create_wordVec(train_x[i], wordSet, "wordBag")) return train_matrix if __name__ == "__main__": train_x, label = load_data() wordSet = setOfWord(train_x) train_matrix = main("wordSet")
- 数据挖掘--序列挖掘
- 数据挖掘--文本挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- 数据挖掘
- JZOJ 100026. 【NOIP2017提高A组模拟7.7】图
- PAT甲级真题及训练集(21)--1102. Invert a Binary Tree (25)
- 【微信开发】开启开发者模式
- JGS SPRING+mybits 出错
- ROS学习笔记(二):安装,环境配置及指令简介
- 数据挖掘
- 进度条
- jstack命令
- memcpy与strcpy区别
- 7.7--SSH学习之Hibernate Session
- POJ 2253 Frogger【SPFA变形】
- 机器学习
- MySQL 5.7 command line client指令总结
- vector容器的几个要点