deeplearning4j对word2vec的介绍

来源：互联网发布：天刀太白捏脸数据女编辑：程序博客网时间：2024/06/05 23:58

Word2Vec

Contents

Introduction
Neural Word Embeddings
Amusing Word2vec Results
Just Give Me the Code
Anatomy of Word2Vec
Setup, Load and Train
A Code Example
Troubleshooting & Tuning Word2Vec
Word2vec Use Cases
Foreign Languages
GloVe (Global Vectors) & Doc2Vec

Introduction to word2vector

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not adeep neural network, it turns text into a numerical form that deep nets can understand.Deeplearning4j implements a distributed form of Word2vec for Java andScala, which works on Spark with GPUs.

翻译：word2vec是一个用来处理文本信息的两层神经网络。它的输入是一个文本语料库，它的输出的一组向量：这些语料中单词的特征向量。然而word2vec并不是一个深层神经网络，它将文本转化成一种能够让深层网络理解的数值形式。Deeplearning4j为java和scala实现了一种分布式的word2vec，可以工作在spark和GPUs上。

Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well togenes, code,likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

翻译：word2vec是对在自然状态下语句解析的延展性的应用。同样可以应用到基因、代码、喜好、播放列表、社交媒介图或者其他的语言或符号系列的模式识别。

Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.

翻译：为什么呢？因为单词是最简单的独立的状态，就像上面提到的数据，并且我们仅仅是在寻找这些状态之间的过渡概率：他们将共现的可能性。所以，gene2vec、like2vec和follower2vec都是可能的。记住这一点，下面的教程将帮助你理解如何为任何独立状态或共现状态的组合创建神经嵌入。

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.

翻译：word2vec的目标和用处是为了把向量空间中相似词语的向量分到同组。这就是说，他能发现数学上的相似之处。word2vec创建的向量，这些向量是由单词的分布的数值来表现的单词特征，例如这些特征可以是单词的上下文。而这过程无需人工干预。

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search,sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.

Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.

discrete 离散的，独立的 adj.

co-occur 共现，同现 n.

detect 发现 vt.

mathematically 算术的，数学上的 adv.

numerical 数值的 adj.

representation 表现，代表 n.

参考资料：

http://deeplearning4j.org/word2vec

0 0