stanford sentiment treebank 数据集
来源:互联网 发布:淘宝全屏海报怎么设置 编辑:程序博客网 时间:2024/05/20 20:03
datasetSentences.txt 格式:句子索引 句子内容
datasetSplit.txt 格式: 句子索引 句子属于哪个集合(1 = train 2 = test 3 = dev)
train有8544条,test有1101条,dev有 2210条
dictionary.txt 格式 :句子(或者短语)| 索引值
sentiment_labels.txt 格式:索引值 | 情感值
句子和短语总有239232条
情感值对应类别:[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] 分别对应五分类情感
将其处理成一句对应一个分数,并且分成训练集和验证集和测试集,和原本的数据些微差别,训练集少了100条数据,验证集少了8条数据,测试集少了10条数据,因为datasetSentences.txt 中有些句子里面的人名表示有特殊字符,和 dictionary.txt 不能匹配上,你也可以手动加上。
python代码如下:
def delblankline(infile1,infile2,trainfile,validfile,testfile):
#### 2 是test,3 是valid的我写错了
info1 = open(infile1,'r')
info2 = open(infile2,'r')
train=open(trainfile,'w')
valid=open(validfile,'w')
test=open(testfile,'w')
lines1 = info1.readlines()
lines2 = info2.readlines()
for i in range(1,len(lines1)):
t1=lines1[i].replace("-LRB-","(")
t2=t1.replace("-RRB-",")")
###把括号部分还原
k=lines2[i].strip().split(",")
t=t2.strip().split('\t')
if k[1]=='1':
train.writelines(t[1])
train.writelines("\n")
elif(k[1]=='2'):
valid.writelines(t[1])
valid.writelines("\n")
elif(k[1]=='3'):
test.writelines(t[1])
test.writelines("\n")
print "end"
info1.close()
info2.close()
train.close()
valid.close()
test.close()
def tag(infile1,infile2,outputfile3):
info1 = open(infile1,'r')
info2 = open(infile2,'r')
info3=open(outputfile3,'w')
lines1 = info1.readlines()
lines2 = info2.readlines()
text={}
for i in range(0,len(lines1)):
s=lines1[i].strip().split("|")
text[s[1]]=s[0]
for j in range(1,len(lines2)):
k=lines2[j].strip().split("|")
if(text.has_key(k[0])):
info3.writelines(text[k[0]])
info3.writelines("\n")
info3.writelines(k[1])
info3.writelines("\n")
print "end2d1"
info1.close()
info2.close()
info3.close()
def tag1(infile0,infile1,infile2,infile3,infile4,infile5,infile6):
info0 = open(infile0,'r')
info1 = open(infile1,'r')
info2 = open(infile2,'r')
info3 = open(infile3,'r')
info4 = open(infile4,'w')
info5 = open(infile5,'w')
info6 = open(infile6,'w')
lines0 = info0.readlines()
lines1 = info1.readlines()
lines2 = info2.readlines()
lines3 = info3.readlines()
for i in range(0,len(lines0),2):
if lines0[i] in lines1:
info4.writelines(lines0[i])
info4.writelines(lines0[i+1])
if lines0[i] in lines2:
info5.writelines(lines0[i])
info5.writelines(lines0[i+1])
if lines0[i] in lines3:
info6.writelines(lines0[i])
info6.writelines(lines0[i+1])
print "end3d1"
info0.close()
info1.close()
info2.close()
info3.close()
info4.close()
info5.close()
info6.close()
delblankline("datasetSentences.txt","datasetSplit.txt","train.txt","valid.txt","test.txt")
tag("dictionary.txt","sentiment_labels.txt","allsentimet.txt")
###到目前为止都没有任何问题
tag1("allsentimet.txt","train.txt","valid.txt","test.txt","train1.txt","valid1.txt","test1.txt")
处理过后得到的数据为train1.txt valid1.txt test1.txt
- stanford sentiment treebank 数据集
- sentiment treebank
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- Treebank
- Penn Treebank词性标记集
- Penn Treebank 词性标注集
- Sentiment
- (自然语言处理文档系列)Penn Treebank词性标记集
- Stanford UFLDL教程 数据预处理
- Stanford
- 训练LSTM模型进行情感分类在IMDB数据集上,使用Keras API(Trains an LSTM model on the IMDB sentiment classification)
- Penn Treebank II Tags
- Python nltk -- Sinica Treebank
- The Penn Treebank
- 【复杂网络系列】SNAP(Stanford Large Network Dataset Collection)实验数据集
- Stanford-parser分解 分词后的数据
- Sentiment today
- sentiment analysis
- Spring实现数据库的读写分离
- map的键值对的排序
- 协程(Coroutine)-ES中关于Generator/async/await的学习思考
- 后台验证码生成
- 常见的各种命令
- stanford sentiment treebank 数据集
- 剑指Offer-9.斐波那契数列
- 神经网络之梯度下降与反向传播(上)
- 《Android Studio日志工具Log之最强辅助》
- 121. Best Time to Buy and Sell Stock
- maven项目中pom.xml一些属性记录
- 【C++】学习笔记三十七——递归
- HTML5 笔记1— details元素、mark元素、progress元素、meter元素
- <感悟4>