条件随机场(CRF)
来源:互联网 发布:淘宝子账户认证 编辑:程序博客网 时间:2024/05/17 09:20
在进行样本标注时,不认为它们类别之间是独立的,而是和它们的先后关系有一定的区别,例如在词性标注中,两个相邻的单词即使在其他语料中几乎都被标记为动词,那么它俩也不可能同时为动词,这样就是利用了样本的顺序信息。
据此,我们可以建立多个特征函数来评价一个标记序列,
Introduction to Conditional Random Fields
Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?
One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.
By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.
Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field does.
Part-of-Speech Tagging
Let’s go into some more detail, using the more common example of part-of-speech tagging.
In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.
For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.
So let’s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we’ll first need to decide on a set of feature functions
Feature Functions in a CRF
In a CRF, each feature function is a function that takes in as input:
a sentence s
the position i of a word in the sentence
the label
the label
and outputs a real-valued number (though the numbers are often just either 0 or 1).
(Note: by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, I’m actually building the special case of a linear-chain CRF. For simplicity, I’m going to ignore general CRFs in this post.)
For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.
Features to Probabilities
Next, assign each feature function
(The first sum runs over each feature function
Finally, we can transform these scores into probabilities
Example Feature Functions
So what do these feature functions look like? Examples of POS tagging features could include:
If the weight
Again, if the weight
Again, a positive weight for this feature means that adjectives tend to be followed by nouns.
A negative weight
And that’s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.
Now let’s step back and compare CRFs to some other common machine learning techniques.
Smells like Logistic Regression…
The form of the CRF probabilities
That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.
Looks like HMMs…
Recall that Hidden Markov Models are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a generative approach to labeling, defining
where
So how do HMMs compare to CRFs? CRFs are more powerful – they can model everything HMMs can and more. One way of seeing this is as follows.
Note that the log of the HMM probability is
That is, we can build a CRF equivalent to any HMM by…
For each HMM transition probability
Similarly, for each HMM emission probability
Thus, the score
However, CRFs can model a much richer set of label distributions as well, for two main reasons:
CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark.
CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfy certain constraints (e.g.,
Learning Weights
Let’s go back to the question of how to learn the feature weights in a CRF. One way is (surprise) to use gradient ascent.
Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones, for each training example…
Go through each feature function
Note that the first term in the gradient is the contribution of feature
Move
Repeat the previous steps until some stopping condition is reached (e.g., the updates fall below some threshold).
In other words, every step takes the difference between what we want the model to learn and the model’s current state, and moves
Finding the Optimal Labeling
Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?
The naive way is to calculate
A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the Viterbi algorithm for HMMs.
A More Interesting Application
Okay, so part-of-speech tagging is kind of boring, and there are plenty of existing POS taggers out there. When might you use a CRF in real life?
Suppose you want to mine Twitter for the types of presents people received for Christmas:
What people on Twitter wanted for Christmas, and what they got: twitter.com/edchedch/statu…
— Edwin Chen (@edchedch) January 2, 2012
(Yes, I just embedded a tweet. BOOM.)
How can you figure out which words refer to gifts?
To gather data for the graphs above, I simply looked for phrases of the form “I want XXX for Christmas” and “I got XXX for Christmas”. However, a more sophisticated CRF variant could use a GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like a POS tagging problem. Features could be based around things like “this word is a GIFT if the previous word was a GIFT-RECEIVER and the word before that was ‘gave’” or “this word is a GIFT if the next two words are ‘for Christmas’”.
Fin
I’ll end with some more random thoughts:
I explicitly skipped over the graphical models framework that conditional random fields sit in, because I don’t think they add much to an initial understanding of CRFs. But if you’re interested in learning more, Daphne Koller is teaching a free, online course on graphical models starting in January.
Or, if you’re more interested in the many NLP applications of CRFs (like part-of-speech tagging or named entity extraction), Manning and Jurafsky are teaching an NLP class in the same spirit.
I also glossed a bit over the analogy between CRFs:HMMs and Logistic Regression:Naive Bayes. This image (from Sutton and McCallum’s introduction to conditional random fields) sums it up, and shows the graphical model nature of CRFs as well:
CRF Diagram
- 条件随机场(CRF)
- 条件随机场(CRF)
- 条件随机场(CRF)
- 条件随机场(CRF)
- 条件随机场(CRF)
- CRF 条件随机场
- CRF-条件随机场
- 条件随机场(CRF)
- 条件随机场(CRF)
- 条件随机场(CRF)
- CRF(条件随机场)
- CRF-条件随机场
- 条件随机场CRF
- 条件随机场CRF
- 条件随机场CRF
- 条件随机场CRF
- 条件随机场CRF
- 条件随机场(CRF)学习
- [BZOJ 2818] Gcd 线性筛+欧拉函数前缀和
- 栈的应用--括号匹配的检验
- Search Insert Position
- WIN7、WIN10下顺利使用S7-200编程软件的方法:
- 观察者模式OC版
- 条件随机场(CRF)
- 怎么限制一个应用程序进程使用指定的cpu
- DEV C++ 界面设置
- 欢迎使用CSDN-markdown编辑器
- idea安装步骤
- spark shuffle详解
- Linux搭建hexo环境
- 我来告诉你怎么计算i++ + ++i + i++以及为什么i++和++i区别这么大?
- bzoj4567[Scoi2016]背单词 贪心+trie