统计SQuAD的词汇得到word2id 并把词都转成id的python代码
来源:互联网 发布:linux下启动tomcat服务 编辑:程序博客网 时间:2024/06/16 20:04
import jsonimport collectionsjson_file = open("train-v1.1.json")data = json.load(json_file)all_words = []for paragraphs_title in data["data"]: all_words.extend(paragraphs_title["title"].split()) paragraphs = paragraphs_title["paragraphs"] for context_qas in paragraphs: all_words.extend(context_qas["context"].split()) qas = context_qas["qas"] for answers_question in qas: answers = answers_question["answers"] all_words.extend(answers_question["question"].split()) if len(answers)>1: print(answers) for answerstart_text in answers: all_words.extend(answerstart_text["text"].split())counter = collections.Counter(all_words)count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))words, _ = list(zip(*count_pairs))word_to_id = dict(zip(words, range(len(words))))data_vec = []for paragraphs_title in data["data"]: title = paragraphs_title["title"] paragraphs = paragraphs_title["paragraphs"] paragraphs_title = [] data_vec.append(paragraphs_title) for context_qas in paragraphs: paragraphs_vec = [] paragraphs_title.append(paragraphs_vec) context_vec = [] questions_answers = [] paragraphs_vec.append(context_vec) paragraphs_vec.append(questions_answers) for word in context_qas["context"].split(): context_vec.append(word_to_id[word]) qas = context_qas["qas"] for answers_question in qas: question_answer = [] questions_answers.append(question_answer) question_vec = [] answer_vec = [] question_answer.append(question_vec) question_answer.append(answer_vec) answers = answers_question["answers"] for word in answers[0]["text"].split(): answer_vec.append(word_to_id[word]) for word in answers_question["question"].split(): question_vec.append(word_to_id[word])print("!")
阅读全文
0 0
- 统计SQuAD的词汇得到word2id 并把词都转成id的python代码
- 经典的把一篇英文文章转成word2id形式的dict的一段python程序
- python 得到HTML指定ID的内容
- 代码的常用词汇
- 得到webpanel的ID
- 根据catalog ID,得到这个菜单的html代码。
- 删除表记录并把ID清零的SQL语句
- 将QtDesigner的ui文件转成可执行的python代码
- 间接得到按钮的ID
- 得到单击对象的ID
- 如何得到goroutine 的 id?
- GreenBrowser群组转成书签的Python代码
- 统计英文单词的个数的python代码
- python写的代码行数统计程序
- 统计项目的代码行数(python处女作)
- python统计代码运行的时长
- java POI word的docx文档中的文字替换,并把docx转成pdf文档
- SQuAD,斯坦福在自然语言处理的野心
- 机房模拟赛 2017.7.3
- 浙大PAT甲级-1015
- STL常用函数复习之————deque
- 拉丁方阵
- HTML5小游戏
- 统计SQuAD的词汇得到word2id 并把词都转成id的python代码
- 2017年7月,桌面分享
- java随机生成6位随机数 5位随机数 4位随机数
- Django-中间件Middleware
- QStackedWidget实现左侧列表与右侧控件关联
- 基于 Vue、Electron、Node、Koa、Python 等构建的独立音乐社区客户端!
- leetcode561: Array Partition I
- java.lang.NoClassDefFoundError: com/fasterxml/jackson/databind/ObjectMapper
- 在 SQLite 中使用 CSV