用gensim导入word2vec词向量bin文件，出现字符编码

来源：互联网发布：英国脱欧最新进展知乎编辑：程序博客网时间：2024/06/16 10:15

首先抛出我遇到的问题。

我训练了一个词向量文件，得到了一个二进制文件，model.bin，然后准备调用gensim来测试bin文件里面的词向量效果怎么样，于是就导入这个模型。

import gensim# 导入模型model = gensim.models.KeyedVectors.load_word2vec_format('t8model.bin',binary=True)print (model['word'])

然后出现以下编码问题

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

查了一下，这是Stack Overflow上的答案

The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered.

然后知道我测试的词在模型中不是utf-8形式的，于是我找了一个以前测试正确的模型，来重新测试，然后就没有出现编码问题。

这就确定了我的问题的原因是由于模型中的词不是utf-8形式的。

现在就去找导致这种结果的原因......

阅读全文

0 0