序列标签与BIO编码
来源:互联网 发布:淘宝网内增高鞋 编辑:程序博客网 时间:2024/06/05 03:09
一、序列标签(Sequence Labeling)
A meta problem that underlies several problems, and understanding of which is essential to finding the solution to those problems. Sequence labeling is the meta problem that we face all the time in NLP tasks. In sequence labeling, we want to assign a single label to each element in a sequence. In particular, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like POS (parts of speech), syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.
大致有两种类型的sequence labeling:
- Raw labeling: Raw labeling is something like POS tagging where each element gets a single tag.
- Joint segmentation and labeling: Joint segmentation and labeling is where whole segments get the same label.
NER (named entity recognition) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Here we take NER as an example. A sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words. This is so-called Joint segmentation and labeling.
二、BIO encoding
In fact, the easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase (or Inside a phrase) of type X" and "O" means "not in a phrase.(or Outside a phrase)"
For instance, we use BIO encoding in POS tagging. In particular, we denote X as Noun phrase (NP) chunking, so we can define three new tags:
- B-NP: beginning of a noun phrase chunk
- I-NP: inside of a noun phrase chunk
- O: outside of a noun phrase chunk
Then we get the result as figure shows below.
If we want to utilize BIO encoding in NER to identify all mentions of named entities (people, organizations, locations, dates) We define many new tags:
- B-PERS, B-DATE,...: beginning of a mention of a person/date...
- I-PERS, B-DATE,... : inside of a mention of a person/date...
- O: outside of any mention of a named entity
参考文献
[1] http://nlpers.blogspot.com.au/2006/11/getting-started-in-sequence-labeling.html
[2] https://courses.engr.illinois.edu/cs498jh/Slides/Lecture08HO.pdf
- 序列标签与BIO编码
- 序列化与编码
- 序列流与编码转换
- bio与bio_vec
- Java NIO与BIO
- NIO 与 BIO 小结
- BIO与NIO、AIO
- NIO与BIO例子
- Hadoop序列化与编码浅析
- Java--编码集与序列化
- bio 与块设备驱动
- bio与块设备驱动
- Redis 之BIO与RIO
- 通用块层与bio
- Java BIO与NIO比较
- BIO
- Bio
- BIO
- 《白话C++》第3章 感受(一) 3.2 Hello world 中文版
- Apache的FileUpload组件
- c++中为什么有不能重载的运算符(摘录)
- 推荐一篇文章——预防堕落的三个秘方
- C#学习纪要(11):7月22日
- 序列标签与BIO编码
- 解答:一个嵌入式新手找工作最经常遇到的困惑
- 倡导新惠农政策 构建美好新生活
- 城乡一体教育为先
- 关于求阶乘的c++程序的数的大小范围的问题,请帮帮忙
- 2.1 Visual C++与C++
- 十种编程语言的注释写法大总结
- 2.1.1 Visual C++不是唯一的C++编译器
- vc中使用 excel