Week6-5Language Modelling2
来源:互联网 发布:java 二进制中文乱码 编辑:程序博客网 时间:2024/05/23 19:13
Smoothing
- If the vocabulary size is
∣V∣=1M - Too many parameters to estimate even a unigram model
- MLE assigns value of 0 to unseen data, let alone bigram and trigram.
- Smoothing(regularization)
- Reassign some probability mass to some unseen data
How to model novel words?
-Distribute some of the probability mass to allow novel events
Add-one(Laplace) smoothing
- Bigrams:
P(wi∣wi−1)=c(wi−1,Wi)+1cwi−1+V - reassign too much probability mass to unseen data
- possible to add k instead of add 1
Advanced smoothing
- Good-turing
- try to predict the probabilities of unseen events based on the probabilities of seen events
- Kneser-Ney
- Class-based n-grams
Good-turing
- Actual count c,
Nc : total number of n-grams that occur exactly c times in the corpusN0 : total number of n-grams in the corpus- Revised count
c∗=(c+1)Nc+1Nc
How do we deal with the sparse data?
Backoff
- Going back to the lower-order n-gram model if the higher-order model is sparse.
- Learning the parameter
Interpolation
- If
P′(wi∣wi−1,wi−2) is sparse:- Use
λ1P′(wi∣wi−1,wi−2)+λ2P′(wi∣wi−1)+λ3P′(wi)
- Use
- See [Chen and Goodman, 1998]
0 0
- Week6-5Language Modelling2
- Week6-6Language Modelling3
- Week6-3,4Language Modelling1
- week6
- coursera Stanford Machine Learning Week6 Ex5机器学习 实验5
- algorithm week6
- week6 练习
- WEEK6 作业
- 作业 week6
- week6 作业
- Week6作业
- week6-week11
- junior-week6
- Leetcode Week6
- leetcode week6
- Algorithm-week6
- language
- language
- Git的安装以及使用
- C++ std命名空间详解
- vba中text的问题和VBA自动调用的问题
- Hadoop2环境搭建(单机伪分布)
- IO_File_常用方法_文件夹操作_命令模式查找JAVA145
- Week6-5Language Modelling2
- Leetcode:Subsets
- 汇编指令---ROL和ROR指令
- 第16周项目1(4)快速排序
- 数据结构之查找——折半查找、插值查找、斐波那契查找
- ListView的选择模式
- 2015/12/19 FFC2
- grep的用法
- 使用Nexus配置Maven私有仓库