程序博客网 > java 二进制中文乱码

Week6-5Language Modelling2

来源：互联网发布：java 二进制中文乱码编辑：程序博客网时间：2024/05/23 19:13

Smoothing

If the vocabulary size is ∣V∣=1M
- Too many parameters to estimate even a unigram model
- MLE assigns value of 0 to unseen data, let alone bigram and trigram.
Smoothing(regularization)
- Reassign some probability mass to some unseen data

How to model novel words?
-Distribute some of the probability mass to allow novel events

Add-one(Laplace) smoothing

Bigrams: P(wi∣wi−1)=c(wi−1,Wi)+1cwi−1+V
reassign too much probability mass to unseen data
possible to add k instead of add 1

Advanced smoothing

Good-turing
- try to predict the probabilities of unseen events based on the probabilities of seen events
- Kneser-Ney
- Class-based n-grams

Good-turing

这里写图片描述

Actual count c,
Nc: total number of n-grams that occur exactly c times in the corpus
N0: total number of n-grams in the corpus
Revised count c∗=(c+1)Nc+1Nc

这里写图片描述

How do we deal with the sparse data?

Backoff

Going back to the lower-order n-gram model if the higher-order model is sparse.
Learning the parameter

Interpolation

If P′(wi∣wi−1,wi−2) is sparse:
- Use λ1P′(wi∣wi−1,wi−2)+λ2P′(wi∣wi−1)+λ3P′(wi)
See [Chen and Goodman, 1998]

0 0

java 二进制中文乱码

java 二进制中文乱码

原创粉丝点击

热门问题 老师的惩罚人脸识别我在镇武司摸鱼那些年重生之率土为王我在大康的咸鱼生活盘龙之生命进化天生仙种凡人之先天五行春回大明朝姑娘不必设防，我是瞎子米汤怎么做米汤的功效米汤功效焦米汤红豆薏米汤营养米汤半夏秫米汤冬瓜海米汤怎么煮米汤薏米汤煮米汤米汤有营养吗喝米汤有什么好处米汤样白带图片白带像米汤水米汤可以减肥吗白带像米汤一样喝米汤能减肥吗西米露西米艿粳米巴斯马蒂大米小便白色浑浊米汤状女性尿液发白像米汤医生不建议婴儿吃米汤男性尿液发白像米汤米沙美卡素替米沙坦片米泉米泉的酒店米泉新楼盘米泉瞎熊沟风景区米波米洛米洛作品逆臣米洛米洛家具禁虏米洛米洛提沙发米洛医生