Word Embedding 和Skip-Gram模型 的实践

什么是Word Embedding?

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

word Embedding其实就是一个对词语进行向量化的高级方法。该方法是一种映射,它把一个词语映射到 Rn n维空间上。该方法对词语进行向量化后的结果倾向于同类词语之间向量的距离会更小。例如在一堆预料中,I like apple和I like watermelon 经常出现,而apple和china不会经常一块出现。那么 apple 和watemelon向量化后两个向量的距离应该比apple和China两个向量化后之间距离小。


另外 如果使用词袋模型,词语向量化后的结果一般是一个维数特别大的向量,而且这个向量中0元素 特别尤其是one-hot模型。然而通过Word embedding 方法向量化的结果一般某个词只需要用150-300维的实向量表示就可以了。

而Skip-Gram就是实现这种词向量化的一种方法,该方法在对词语进行one-hot向量化的基础上再使用一个简单的分类神经网络来学习各个词的词向量。而里面的Skip-Gram 还考虑到了词的上下文语义,通过词在语料库里面的相邻词语来构成训练样本。skip-Gram中的Skip就是指示了上下文的边界。比如说:I live in china and I like Chinese food. 如果以China为中心词,规定skip=3的词都是china的上下文词, 这就相当于在说 我们认为


接下来,我们就来看看如何提高Skip-Gram 来对词语进行向量化吧。


 Pumas are large, cat-like animals which are found in America. When reports came into London Zoo that a wild puma had been spotted forty-five miles south of London, they were not taken seriously. However, as the evidence began to accumulate, experts from the Zoo felt obliged to investigate, for the descriptions given by people who claimed to have seen the puma were extraordinarily similar.The hunt for the puma began in a small village where a woman picking blackberries saw ‘a large cat’ only five yards away from her. It immediately ran away when she saw it, and experts confirmed that a puma will not attack a human being unless it is cornered. The search proved difficult, for the puma was often observed at one place in the morning and at another place twenty miles away in the evening. Wherever it went, it left behind it a trail of dead deer and small animals like rabbits. Paw prints were seen in a number of places and puma fur was found clinging to bushes. Several people complained of “cat-like noises’ at night and a businessman on a fishing trip saw the puma up a tree. The experts were now fully convinced that the animal was a puma, but where had it come from? As no pumas had been reported missing from any zoo in the country, this one must have been in the possession of a private collector and somehow managed to escape. The hunt went on for several weeks, but the puma was not caught. It is disturbing to think that a dangerous wild animal is still at large in the quiet countryside.

1. 清洗数据,把标点符号去除,提取其中的单词

__author__ = 'jmh081701'import redef getWords(data):    rule=r"([A-Za-z-]+)"    pattern =re.compile(rule)    words=pattern.findall(data)    return words


>>words = getWords(data)>>print (words[0:5]['Pumas', 'are', 'large', 'cat-like', 'animals']
  1. 找出一个 有多少个不同单词,并统计他们的频数
def enumWords(words):    rst={}    for each in words:        if not each in rst:            rst.setdefault(each,1)        else:            rst[each]+=1    return rst


>>words = getWords(data)>>words = enumWords(words)>>print(len(words))154



>>vocaburary=list(words)>>print(vocaburary[0:5])['accumulate', 'which', 'reported', 'spotted', 'think']

4.使用one -hot 对每个词进行初步向量化



那么 ‘accumulate’ 进行one-hot后的结果就是:[1,0,0,,00,0,0]

def onehot(word,vocaburary):    l =len(vocaburary)    vec=numpy.zeros(shape=[1,l],dtype=numpy.float32)    index=vocaburary.index(word)    vec[0][index]=1.0    return  vec


>>print(onehot(vocaburary[0],vocaburary))[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  ······ 0.]]


def getContext(words):    rst=[]    for index,word in enumerate(words):        skip_size=random.randint(1,5)        for i in range(max(0,index-skip_size),index):            rst.append([word,words[i]])        for i in  range(index+1,min(len(words),index+skip_size)):            rst.append([word,words[i]])    return  rst


>>Context=getContext(getWords(data))>>print(Context[0:5])[['Pumas', 'are'], ['are', 'Pumas'], ['are', 'large'], ['are', 'cat-like'], ['large', 'are']]



我们把’Pumas’ one hot 后的结果作为X,把‘are’进行one-hot的结果作为Y.

>>Context=getContext(getWords(data))>>X=onehot(Context[0][0],vocaburary)>>Y=onehot(Context[0][1],vocaburary)>>print(X)[[ 0.  ···  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  ···]]>>print(Y)[[ 0. ....  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.   0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

ok,我们可以知道 输入和输出其实都是one-hot后的两个154维的列向量。

输入层:1 x 154
隐含层:权值矩阵:W shape:154x30
偏置:b shape:1x30
激活函数 :无
输出层:权值矩阵:W’ shape:30x154
偏置:b’ shape:1x154
