词向量源码解析：（5.8）ngram2vec源码解析之counts2ppmi等

来源：互联网发布：张逗张花知乎编辑：程序博客网时间：2024/06/10 12:05

我们下面要把共现矩阵加权，得到PPMI矩阵。counts2ppmi这个名字起得不是特别准确，这个文件实际上生成的是PMI矩阵。可能是为了统一，这个工具包把所有应该叫PMI的地方都变成了PPMI。ngram2vec中的counts2ppmi比较合理的利用了scipy中的稀疏矩阵，能很快的从文件中把共现矩阵建立好，然后加权得到PMI矩阵。这里默认要能把所有的三元组读进来，所以可能内存不足。

def read_counts_matrix(words_path, contexts_path, counts_path):
wi, iw = load_vocabulary(words_path)//读取中心词词典
ci, ic = load_vocabulary(contexts_path)//读取上下文词典
counts_num = 0
row = []//非零元素行id
col = []//非零元素列id
data = []//非零元素值
with open(counts_path) as f:
print str(counts_num/1000**2) + "M counts processed."
for line in f:
if counts_num % 1000**2 == 0:
print "\x1b[1A" + str(counts_num/1000**2) + "M counts processed."
word, context, count = line.strip().split()//把三元组读进来
row.append(int(word))
col.append(int(context))
data.append(int(float(count)))
counts_num += 1
counts = csr_matrix((data, (row, col)), shape=(len(wi), len(ci)), dtype=np.float32)//得到稀疏矩阵存储的共现矩阵，由于counts已经排好序，这步没什么代价
return counts

剩下的计算PMI矩阵的部分和hyperwords没有区别。

def calc_pmi(counts, cds):
sum_w = np.array(counts.sum(axis=1))[:, 0]
sum_c = np.array(counts.sum(axis=0))[0, :]
if cds != 1:
sum_c = sum_c ** cds
sum_total = sum_c.sum()
sum_w = np.reciprocal(sum_w)
sum_c = np.reciprocal(sum_c)

pmi = csr_matrix(counts)
pmi = multiply_by_rows(pmi, sum_w)
pmi = multiply_by_columns(pmi, sum_c)
pmi = pmi * sum_total
return pmi

def multiply_by_rows(matrix, row_coefs):
normalizer = dok_matrix((len(row_coefs), len(row_coefs)))
normalizer.setdiag(row_coefs)
return normalizer.tocsr().dot(matrix)

def multiply_by_columns(matrix, col_coefs):
normalizer = dok_matrix((len(col_coefs), len(col_coefs)))
normalizer.setdiag(col_coefs)
return matrix.dot(normalizer.tocsr())

PPMI到SVD的代码依然和hyperwords没有什么区别。这里的叫ppmi2svd是对的，SVD读取的是PPMI矩阵，representations包中会对PMI进行简单的处理。

def main():
args = docopt("""
Usage:
ppmi2svd.py [options] <ppmi> <output>

Options:
--dim NUM Dimensionality of eigenvectors [default: 300]
--neg NUM Number of negative samples; subtracts its log from PMI [default: 1]
""")

ppmi_path = args['<ppmi>']
output_path = args['<output>']
dim = int(args['--dim'])
neg = int(args['--neg'])

explicit = PositiveExplicit(ppmi_path, normalize=False, neg=neg)//PPMI矩阵，PositiveExplicit类对PMI矩阵进行简单的加工

ut, s, vt = sparsesvd(explicit.m.tocsc(), dim)

np.save(output_path + '.ut.npy', ut)
np.save(output_path + '.s.npy', s)
np.save(output_path + '.vt.npy', vt)

阅读全文

0 0