序列标签与BIO编码

来源：互联网发布：淘宝网内增高鞋编辑：程序博客网时间：2024/06/05 03:09

一、序列标签（Sequence Labeling）

A meta problem that underlies several problems, and understanding of which is essential to finding the solution to those problems. Sequence labeling is the meta problem that we face all the time in NLP tasks. In sequence labeling, we want to assign a single label to each element in a sequence. In particular, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like POS (parts of speech), syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.

大致有两种类型的sequence labeling：

Raw labeling: Raw labeling is something like POS tagging where each element gets a single tag.
Joint segmentation and labeling: Joint segmentation and labeling is where whole segments get the same label.

NER (named entity recognition) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Here we take NER as an example. A sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words. This is so-called Joint segmentation and labeling.

二、BIO encoding

In fact, the easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase (or Inside a phrase) of type X" and "O" means "not in a phrase.(or Outside a phrase)"

For instance, we use BIO encoding in POS tagging. In particular, we denote X as Noun phrase (NP) chunking, so we can define three new tags:

B-NP: beginning of a noun phrase chunk
I-NP: inside of a noun phrase chunk
O: outside of a noun phrase chunk

Then we get the result as figure shows below.

If we want to utilize BIO encoding in NER to identify all mentions of named entities (people, organizations, locations, dates) We define many new tags:

B-PERS, B-DATE,...: beginning of a mention of a person/date...
I-PERS, B-DATE,... : inside of a mention of a person/date...
O: outside of any mention of a named entity

Then we get the result as figure shows below.

参考文献

[1] http://nlpers.blogspot.com.au/2006/11/getting-started-in-sequence-labeling.html

[2] https://courses.engr.illinois.edu/cs498jh/Slides/Lecture08HO.pdf