论文复现|pointer-generator

来源：互联网发布：java udp函数编辑：程序博客网时间：2024/05/02 20:20

论文代码链接：https://github.com/becxer/pointer-generator/

一、数据（cnn,dailymail）
数据处理（代码链接）：https://github.com/becxer/cnn-dailymail/

把数据集处理成二进制形式

1、下载数据
需翻墙，下载cnn和daily mail的两个stories文件

有的文件包含的例子中的文章缺失了，新代码中把这些去除了。

2、下载Stanford corenlp（现在最新版是3.8.0，但是笔者试了是不行的，必须要用3.7.0版的）

环境：linux

我们需要Stanford corenlp来把数据分词。
把下列这行代码加到你的bash_profile里面

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0

把/path/to/替换为你保存stanford-corenlp-full-2016-10-31的地方的路径

检测：
运行下列代码：

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

你会看到下列输出：

Pleasetokenizethistext.PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3、Process into .bin and vocab files

运行：

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

把/path/to/cnn/stories替换为你保存cnn/stories文件的路径；dailymail同样

这个脚本做了以下几件事：1、将生成cnn_stories_tokenized和dm_stories_tokenized两个文件夹，里面的数据是已经被分词了的的cnn/stories和dailymail/stories。这可能需要花一些时间。你可能会看到一些来自Stanford Tokenizer “Untokenizable:”的警告，这似乎是跟Unicode character有关。2、对于每一个all_train.txt, all_val.txt and all_test.txt，相应的分词的数据，被小写进二进制文件train.bin, val.bin and test.bin中。同时放在新生产的finished_files文件夹里，这也需要花点时间。3、例外，从训练数据中会生成一个vocab文件，这个文件也被放在finished_files里。4、最后，train.bin, val.bin and test.bin将被分为数据块，每个数据块里有1000个例子。这些数据块文件会被保存在finished_fies/chunked里，例如train_000.bin, train_001.bin, …, train_287.bin。你可以使用单独的文件或者数据块作为模型的输入。（注意事项)

阅读全文

0 0