sentencePiece 分词原理学习

来源：互联网发布：苹果cms监控软件手机版编辑：程序博客网时间：2024/06/05 03:43

github代码:https://github.com/google/sentencepiece

训练:

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --model_type=<type>

model_name为保存的模型为model_name.model,词典为model_name.vocab,词典大小可以人为设定vocab
_size.训练模型包括unigram (default), bpe, char, or word四种类型.

分词:

对于每一句话输入形式为:

echo "我在北京天安门广场" | spm_encode --model=model_2014_bpe.model

▁我在北京天安门广场

输入为文件形式:

spm_encode --model=model_2014_bpe.model msr_test.utf8

即对文件msr_test.utf8分词处理.

bpe分词原理,python代码参考github:https://github.com/rsennrich/subword-nmt

训练:

对于训练本文输入句子,如, 我在北京天安门广场,我是北京人,提取全部词,即我,在,北京,天安门,广场,统计每个词出现的频率,代码如下:

def get_vocabulary(fobj, is_dict=False):    """Read text and return dictionary that encodes vocabulary    """    vocab = Counter()    for line in fobj:        if is_dict:            word, count = line.strip().split()            vocab[word] = int(count)        else:            for word in line.split():                vocab[word] += 1    return vocab

对词典排序,sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True),得到词典和词频为,我2,在1,北京2,天安门1,广场1,是1,人1

对于每个词的相邻字,生成字对,即有,我/w,在/w,北京,京/w,天安,安门,门/w,每个字对(不足两个字的补充结束符/w)的权重为对应的词的频率,这样得到了很多字对及其权重,我/w2,在/w1,北京2,京/w2,天安1,安门1,门/w1,广场1,场/w1,是/w1,人/w1,词典切分对代码如下:

def get_pair_statistics(vocab):    """Count frequency of all symbol pairs, and create index"""    # data structure of pair frequencies    stats = defaultdict(int)    #index from pairs to words    indices = defaultdict(lambda: defaultdict(int))    for i, (word, freq) in enumerate(vocab):        prev_char = word[0]        for char in word[1:]:            stats[prev_char, char] += freq            indices[prev_char, char][i] += 1            prev_char = char    return stats, indices

得到字对,stats, indices = get_pair_statistics(sorted_vocab),stats为字对,indices为每个字对应的词的索引,以及该字对在该词中出现的次数.

得到字对后,我们求取stats中最大词频的对pair(词频越大,该字对越可能组成词),保存该对到模型文件model中,代码为:

if stats:            most_frequent = max(stats, key=lambda x: (stats[x], x))if stats[most_frequent] < args.min_frequency:    sys.stderr.write('no pair has frequency >= {0}. Stopping\n'.format(args.min_frequency))    break

之后需要求pair在那些词中出现过,

changes = replace_pair(most_frequent, sorted_vocab, indices)

并对这些词生成的其他pair的词频减去该词的词频,例如pair在vocab[words2]=3,vocab[words5]=7中出现过,而word2对应的字对有,pair,pair1,pair3,words5对应的字对有,pair,pair6,pair9,那么stats[pair1]-=vocab[words2],stats[pair3]-=vocab[words2],stats[pair6]-=vocab[words5],stats[pair9]-=vocab[words5],代码为

for j, word, old_word, freq in changed:    # find all instances of pair, and update frequency/indices around it    i = 0    while True:        try:            i = old_word.index(first, i)#找到该对的第一项在词中对应的索引        except ValueError:            break        if i < len(old_word)-1 and old_word[i+1] == second:            if i:                prev = old_word[i-1:i+1]                stats[prev] -= freq#对于该对对应的词中的其他对,对应的stats项将去该词的词频                indices[prev][j] -= 1#并且对应的indices减1,j为词的索引,prev为字对            if i < len(old_word)-2:                # don't double-count consecutive pairs                if old_word[i+2] != first or i >= len(old_word)-3 or old_word[i+3] != second:                    nex = old_word[i+1:i+3]                    stats[nex] -= freq                    indices[nex][j] -= 1            i += 2        else:            i += 1

合并该pair对应的词中的pair项,例如pair项为,天安,对应的词为,天,安,门,广,场,则现在词为,天安,门,广,场,由该词得到对,这样又可以得到新的对,天安,门,将其添加进stats,并且pair对应的stats为0,将其删除.

stats[pair] = 0#该对的stats项置0    indices[pair] = defaultdict(int)#该对的indices项置为空    first, second = pair    new_pair = first+secondi = 0while True:#对于新合并的字对,例如词,包括,殊,.,/w,现在字对./w词频最高,合并该字对,则得到newword=殊,./w    #现在得到新的对殊./w,该新的对对应的stats值为对应的oldword的词频,indices也加1    try:        i = word.index(new_pair, i)    except ValueError:        break    if i:        prev = word[i-1:i+1]        stats[prev] += freq        indices[prev][j] += 1    # don't double-count consecutive pairs    if i < len(word)-1 and word[i+1] != new_pair:        nex = word[i:i+2]        stats[nex] += freq        indices[nex][j] += 1    i += 1

这部分函数为:

changes = replace_pair(most_frequent, sorted_vocab, indices)# 对于包含该字对的词所生成的对stats减去该字对对应的权重freq,该对应的indices-1,添加新生成的对到statsupdate_pair_statistics(most_frequent, changes, stats, indices)

模型step为args.symbols,但如果stats为空,或者所有的stats项的词频小于最小词频,则停止训练.
big_stats = copy.deepcopy(stats)
# threshold is inspired by Zipfian assumption, but should only affect speed
#stats.values()为字对的词频
threshold = max(stats.values()) / 10
for i in range(args.symbols):
if stats:
most_frequent = max(stats, key=lambda x: (stats[x], x))

        # we probably missed the best pair because of pruning; go back to full statistics        if not stats or (i and stats[most_frequent] < threshold):            prune_stats(stats, big_stats, threshold)            stats = copy.deepcopy(big_stats)            most_frequent = max(stats, key=lambda x: (stats[x], x))            # threshold is inspired by Zipfian assumption, but should only affect speed            threshold = stats[most_frequent] * i/(i+10000.0)            prune_stats(stats, big_stats, threshold)        if stats[most_frequent] < args.min_frequency:            sys.stderr.write('no pair has frequency >= {0}. Stopping\n'.format(args.min_frequency))            break        if args.verbose:            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))        args.output.write('{0} {1}\n'.format(*most_frequent))        changes = replace_pair(most_frequent, sorted_vocab, indices)        # 对于包含该字对的词所生成的对stats减去该字对对应的权重freq,该对应的indices-1,添加新生成的对到stats        update_pair_statistics(most_frequent, changes, stats, indices)        stats[most_frequent] = 0#将该字对权重置0        if not i % 100:            prune_stats(stats, big_stats, threshold)#从stats中删除权重小于threshold的对

测试:

对于输入句子,如, 我在北京天安门广场,将其相邻字切分成对,即有,我在,在北,北京,京天,天安,安门,门广,广场,场/w,查找每个字对在模型中的权重,若该字对不再模型中,则返回权重为inf,取权重最小的对,合并该对为一个词,将该词最为一个项,重新切分整个句子,再重复之前的过程,知道所有的切分对都不在模型中时,分割结束.

第一次切分,查找最小权重对为,北京,合并北京得到新的句子,我/在/北京/天/安/门/广/场,将其切分,得到:我在,在北京,北京天,天安,安门,门广,广场,在此查找这些对是否在模型中,发现所有对都不在模型中,因此将其作为最后的分词结果,分词结果为: 我在北京天安门广场

切分代码如下:

def encode(orig, bpe_codes, cache={}):    """Encode word based on list of BPE merge operations, which are applied consecutively    """    if orig in cache:        return cache[orig]    word = tuple(orig) + ('</w>',)    pairs = get_pairs(word)#得到相邻字的对    while True:        #查找pair中的对在bpe_coders中的权重,取权重最小的pair,若pair不再bpe_codes中,则返回权重值为inf        pair_min=0        for pair in pairs:            pair_min=bpe_codes.get(pair, float('inf'))        bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))        # 查找word中相邻的词的对的权重最小,直到最小对不在bpe_codes中停止查找,即所有的对都不在bpe_codes中        if bigram not in bpe_codes:            break        first, second = bigram        new_word = []        i = 0        while i < len(word):            try:                j = word.index(first, i)                new_word.extend(word[i:j])                i = j            except:                new_word.extend(word[i:])                break            #合并该权重最小的对,组成新的词            if word[i] == first and i < len(word)-1 and word[i+1] == second:                new_word.append(first+second)                i += 2            else:                new_word.append(word[i])                i += 1        new_word = tuple(new_word)        word = new_word        if len(word) == 1:            break        else:            pairs = get_pairs(word)    # don't print end-of-word symbols    if word[-1] == '</w>':        word = word[:-1]    elif word[-1].endswith('</w>'):        word = word[:-1] + (word[-1].replace('</w>',''),)    cache[orig] = word    return word

由上可见,对于bpe算法,.训练数据很关键,例如,我在北京天安门广场,实际应该分词为,我在北京天安门广场,而由于训练语料得到的模型中不存在天安门广场对,依次错分.

阅读全文

0 0