Week6-3,4Language Modelling1
来源:互联网 发布:vscode 远程调试 编辑:程序博客网 时间:2024/06/05 08:12
Probabilistic language model
- Assign a probability to a sentence
P(S)=P(w1,w2,...,wn)
- Different from deterministic methods using CFG
- The sum of the probabilities of all possible sentences must add up to 1
Predicting the next word
Uses of LM
- Speech recognition
- P(recognize speech) > P(wreck a nice beach)
- Text gerenation
- P(three houses) > P(three house)
- Spelling correction
- P(my cat eats fish) > P(my xat eats fish)
- Machine translation
- P(the blue house) > P(the house blue)
- OCR
- …
Probability of a sentence
N-gram model
- Markov assumption: only look at limited history
- Unigram
- Bigram
- Trigram
- It is possible to go to 3, 4, 5 grams
N-grams
- Shakespeare unigrams
- 29524 types, approx 900k tokens
- Bigrams
- 346097 types
- Sparse data!!
Estimation
We cannot compute the conditional probability directly due to the data sparseness, so we have to use Markov Assumption.
MLE
Using training data
Unigram Example
- The word pizza appears 700 times in a corpus of
1×107 wordsPML(pizza)=7001×107=7×10−5
Bigram Example
- The word with appears 1000 times in the corpus
- the phrase with spinach appears 6 times
PML(spinach∣with)=count(with spinach)count(with)=61000=0.006
The estimation is domain-based, and it may be not good for other gerenes
N-grams and regular languages
- N-grams are just one way to represent the weighted regular languages
Generative models
Engineering trick
- The MLE values are often on the order of
10−6 or less- multiplying 20 such values gives a number on the order of
10−120 - this leads to underflow
- multiplying 20 such values gives a number on the order of
- Use (base 10) logarithms instead
10−6 becomes -6- Use sums instead of products
0 0
- Week6-3,4Language Modelling1
- Week6-5Language Modelling2
- Week6-6Language Modelling3
- week6
- week6--4月7日
- week6--4月8日
- week6--4月9日
- 3D游戏编程与设计 Week6
- C程序设计进阶week6(指针3)
- algorithm week6
- week6 练习
- WEEK6 作业
- 作业 week6
- week6 作业
- Week6作业
- week6-week11
- junior-week6
- Leetcode Week6
- block,inline和inline-block概念和区别
- GDB调试
- Android更新BaseAdapter
- AceAdmin In MVC之控件
- display:inline、block、inline-block的区别
- Week6-3,4Language Modelling1
- Graphic知识点摘记
- 第13周项目4 算法验证—拓扑排序算法
- VS 和 VAssistX 常用快捷键
- iOS开发之关闭ARC环境
- Java多线程中sleep,wait区别
- ifXXX if XXX else if 的执行顺序 [
- 第十六周实践项目 - 英文单词的基数排序
- 视频的横竖屏