【毕业设计_day06】语料库处理_思路

来源：互联网发布：centos怎么输入中文编辑：程序博客网时间：2024/05/16 07:50

语料库处理_思路

在文档层面上，获取所有的 word-count & word-context 信息 -- 从语料库中获取上下文信息；

1. 以编号的形式加载停止词典：

a) 1. 读入对象temp，加载停止词典词语的编号信息

HashMap<String, Integer>：langOneWordID;

b) E:\Workspaces\extractLexicon1119\workfiles\langOneWordID

c) 停止词典

HashSet<Integer>: langOneStopWords

-->按行分为String[]：langOneWords 进行遍历，

将langOneWords 转为小写后，将其对应的ID号保存到langOneStopWords 中。

2. 对语言一语料库进行预处理

a) 处理所有的标注信息，以获取所有词语的词频信息

i. 数据结构：语料库中所有文档的(ID,count)信息链表：

ArrayList<HashMap<Integer, Integer>>

ii. 根据输入的语料库，依次回去对应的文档信息，并进行后续的处理

1. 数据结构：当前文档中的词语(ID,count)：

HashMap<Integer, Integer> wordCountForCurrentDoc

2. 文档中含有的句子数：

int countOfSent

当前文档的词语(wordID, posID):

int[][] posInfo

3. 统计句子数量[标志：posInfo[i][1] == 5 ？？]，删除停止词[遍历posInfo，将非stopwords词存到HashMap wordCountForCurrentDoc中].

4. 添加当前文档的word-count信息currentWordCount和句子数目信息sentNumForDocs，保存句子数目信息写到磁盘：

E:\Workspaces\extractLexicon1119\workfiles\langOne\SentCountForDocs

3. 从语料库中获取上下文信息

a) 加载并获取整个语料库的标注信息：

b) 数据结构：所有文档的标注信息(wordID, posID)：

ArrayList<int[][]> ： currentPosOfAllDocs，

停止词id :

HashSet<Integer> currentStopWords

c) BufferedWriter将上下文信息写出：

i. 数据结构：当前文档的上下文信息：( wordID, (contextWordID, contextWordCount) ) ：

HashMap<Integer, HashMap<Integer, Integer>>： wordContextForCurrentDoc

ii. 数据结构：当前文档（wordID, posID）：

int[][] posInfo（根据docID获取）其中posID=posInfo[i][1]

iii. 处理当前文档的每个句子，以获取相应的上下文信息：

1. 添加上下文信息词对word1-word2（更新上下文信息）

2. 数据结构：（wordID,contextInfo_count）

HashMap<Integer, Integer>：contextInfo

3. 上下文信息的范围：contextWindow=10

4. 当前处理的单词，不在currentStopWords中，且词性posID为”1,2,3,4”任意值，则将该词及其后contextWindow个词，遍历保存在contextInfo中

5. startOfSentInDoc: 一个句子的开始

6. 数据结构：（wordID,(contextWord,wordCount)）

HashMap<Integer, int[][]>： newWordContextForCurrentDoc

[由wordID获取contextCount，再得contextWord而得wordCount]

7. 输出当前文档的上下文信息

StringBuilder：

docID word:WORDID_contextInfo[][0]:COUNT_contextInfo[][1]

0 0