CNN情感分类

来源：互联网发布：网络大专文凭含金量编辑：程序博客网时间：2024/05/22 16:14

CNN文本分类数据和代码请点击打开链接

实验用到的数据是当当网书评,举例见下方：

Negtive：不喜欢其人，更不喜欢其文，古韵有些过头，矫情！
Positive：这本书真的很好啊，应该打五星，只可惜评分好像无法修改，很有作者的个人风格和简介，能学到不少健康知识，作者您还能再出书吗？要是还出这样的好书，我一定买。谢谢当当热心的顾客推荐，我才能买到这些好书，也谢谢当当，祝好心人合家欢乐，身体健康！~

（1）Load data and labels

def load_data_and_labels():  """  Loads MR polarity data from files, splits the data into words and generates labels.  Returns split sentences and labels.  """  # Load data from files  positive_examples = list(codecs.open("./data/chinese/pos.txt", "r", "utf-8").readlines())  positive_examples = [s.strip() for s in positive_examples]  negative_examples = list(codecs.open("./data/chinese/neg.txt", "r", "utf-8").readlines())  negative_examples = [s.strip() for s in negative_examples]  # Split by words  x_text = positive_examples + negative_examples  # x_text = [clean_str(sent) for sent in x_text]  x_text = [list(s) for s in x_text]  # Generate labels  positive_labels = [[0, 1] for _ in positive_examples]  negative_labels = [[1, 0] for _ in negative_examples]  y = np.concatenate([positive_labels, negative_labels], 0)  return [x_text, y]

这个函数的作用是从文件中加载positive和negative数据，将它们组合在一起，并对每个句子都进行分词，因此x_text是一个二维列表，存储了每个review的每个word；它们对应的labels也组合在一起，由于labels实际对应的是二分类输出层的两个神经元，因此用one-hot编码成0/1和1/0，然后返回y。
其中，f.readlines()的返回值就是一个list，每个元素都是一行文本（str类型，结尾带有”\n”），因此其实不需要在外层再转换成list()
用s.strip()函数去掉每个sentence结尾的换行符和空白符。
去除了换行符之后，由于刚才提到的问题，每个sentence还需要做一些操作（具体在clean_str()函数中），将标点符号和缩写等都分割开来。英文str最简洁的分词方式就是按空格split，因此我们只需要将各个需要分割的部位都加上空格，然后对整个str调用split(“ “)函数即可完成分词。
labels的生成也类似。

（2）Pad sentence

def pad_sentences(sentences, padding_word="<PAD/>"):  """  Pads all sentences to the same length. The length is defined by the longest sentence.  Returns padded sentences.  """  sequence_length = max(len(x) for x in sentences)  padded_sentences = []  for i in range(len(sentences)):    sentence = sentences[i]    num_padding = sequence_length - len(sentence)    new_sentence = sentence + [padding_word] * num_padding    padded_sentences.append(new_sentence)  return padded_sentences

为什么要对sentence进行padding？
因为TextCNN模型中的input_x对应的是tf.placeholder，是一个tensor，shape已经固定好了，比如[batch, sequence_len]，就不可能对tensor的每一行都有不同的长度，因此需要找到整个dataset中最长的sentence的长度，然后在不足长度的句子的末尾加上padding words，以保证input sentence的长度一致。
由于在load_data函数中，得到的是一个二维列表来存储每个sentence数据，因此padding_sentences之后，仍以这样的形式返回。只不过每个句子列表的末尾可能添加了padding word。

（3）Build vocabulary

def build_vocab(sentences):  """  Builds a vocabulary mapping from word to index based on the sentences.  Returns vocabulary mapping and inverse vocabulary mapping.  """  # Build vocabulary  word_counts = Counter(itertools.chain(*sentences))  # Mapping from index to word  vocabulary_inv = [x[0] for x in word_counts.most_common()]  # Mapping from word to index  vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}  return [vocabulary, vocabulary_inv]

此处，
我们知道，collections模块中的Counter可以实现词频的统计，例如：

from collections import Counterimport collectionssentence = ["i", "love", "mom", "mom","mom","me","loves", "me"]word_counts=collections.Counter(sentence)print word_countsprint word_counts.most_common() vocabulary_inv = [x[0] for x in word_counts.most_common()]print vocabulary_invvocabulary_inv = list(sorted(vocabulary_inv))print vocabulary_invvocabulary = {x: i for i, x in enumerate(vocabulary_inv)}print vocabularyprint vocabulary_invprint [vocabulary,vocabulary_inv]

输出结果：
[('mom', 3), ('me', 2), ('i', 1), ('love', 1), ('loves', 1)]
['mom', 'me', 'i', 'love', 'loves']
['i', 'love', 'loves', 'me', 'mom']
{'i': 0, 'me': 3, 'love': 1, 'mom': 4, 'loves': 2}

Counter接受的参数是iterable，但是现在有多个句子列表，如何将多个sentence word list中的所有word由一个高效的迭代器生成呢？
这就用到了itertools.chain(*iterables)

将多个迭代器作为参数, 但只返回单个迭代器, 它产生所有参数迭代器的内容, 就好像他们是来自于一个单一的序列.

由此可以得到整个数据集上的词频统计，word_counts。
但是要建立字典vocabulary，就需要从word_counts中提取出每个pair的第一个元素也就是word（相当于Counter在这里做了一个去重的工作），不需要根据词频建立vocabulary，而是根据word的字典序，所以对vocabulary进行一个sorted，就得到了字典顺序的word list。首字母小的排在前面。（例子中是根据词频的）
再建立一个dict，存储每个word对应的index，也就是vocabulary变量。

（4）Build input data

def build_input_data(sentences, labels, vocabulary):  """  Maps sentencs and labels to vectors based on a vocabulary.  """  #x present index matrix vocabulary[word] to get index  x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])  y = np.array(labels)  return [x, y]

由上面两个函数我们得到了所有sentences分词后的二维列表，sentences对应的labels，还有查询每个word对应index的vocabulary字典。
但是！！想一想，当前的sentences中存储的是一个个word字符串，数据量大时很占内存，因此，最好存储word对应的index，index是int，占用空间就小了。
因此就利用到刚生成的vocabulary，对sentences的二维列表中每个word进行查询，生成一个word index构成的二维列表。最后将这个二维列表转化成numpy中的二维array。
对应的lables因为已经是0,1的二维列表了，直接可以转成array。
转成array后，就能直接作为cnn的input和labels使用了。

（5）Load data

def load_data():  """  Loads and preprocessed data for the MR dataset.  Returns input vectors, labels, vocabulary, and inverse vocabulary.  """  # Load and preprocess data  sentences, labels = load_data_and_labels()  sentences_padded = pad_sentences(sentences)  vocabulary, vocabulary_inv = build_vocab(sentences_padded)  x, y = build_input_data(sentences_padded, labels, vocabulary)  return [x, y, vocabulary, vocabulary_inv]

最后整合上面的各部分处理函数，

1.首先从文本文件中加载原始数据，一开始以sentence形式暂存在list中，然后对每个sentence进行clean_str，并且分词，得到word为基本单位的二维列表sentences，labels对应[0,1]和[1,0]
2.找到sentence的最大长度，对于长度不足的句子进行padding
3.根据数据建立词汇表，按照字典序返回，且得到每个word对应的index。
4.将str类型的二维列表sentences，转成以int为类型的sentences，并返回二维的numpy array作为模型的input和labels供后续使用。

（6）Generate batch

def batch_iter(data, batch_size, num_epochs):  """  Generates a batch iterator for a dataset.  """  data = np.array(data)  data_size = len(data)  num_batches_per_epoch = int(len(data)/batch_size) + 1  for epoch in range(num_epochs):    # Shuffle the data at each epoch    shuffle_indices = np.random.permutation(np.arange(data_size))    shuffled_data = data[shuffle_indices]    for batch_num in range(num_batches_per_epoch):      start_index = batch_num * batch_size      end_index = min((batch_num + 1) * batch_size, data_size)      yield shuffled_data[start_index:end_index]

这个函数的作用是在整个训练时，定义一个batches = batch_iter(…)，整个训练过程中就只需要for循环这个batches即可对每一个batch数据进行操作了。

batches=batch_iter(...)

for batch in batches:

处理batch

Yield
Yield的用法有点像return,除了它返回的是一个生成器
了掌握yield的精髓，你一定要理解它的要点：当你调用这个函数的时候，你写在这个函数中的代码并没有真正的运行。这个函数仅仅只是返回一个生成器对象。有点过于奇技淫巧:-)

然后，你的代码会在每次for使用生成器的时候run起来。

现在是解释最难的地方：
当你的for第一次调用函数的时候，它生成一个生成器，并且在你的函数中运行该循环，知道它生成第一个值。然后每次调用都会运行循环并且返回下一个值，知道没有值返回为止。该生成器背认为是空的一旦该函数运行但是不再刀刀yield。之所以如此是因为该循环已经到达终点，或者是因为你再也不满足“if/else”的条件。

阅读全文

0 0