stanford sentiment treebank 数据集

来源:互联网 发布:淘宝全屏海报怎么设置 编辑:程序博客网 时间:2024/05/20 20:03


datasetSentences.txt  格式:句子索引 句子内容

datasetSplit.txt  格式: 句子索引  句子属于哪个集合(1 = train   2 = test   3 = dev)

train有8544条,test有1101条,dev有 2210条


dictionary.txt  格式 :句子(或者短语)| 索引值

sentiment_labels.txt  格式:索引值 | 情感值

句子和短语总有239232条

情感值对应类别:[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] 分别对应五分类情感


将其处理成一句对应一个分数,并且分成训练集和验证集和测试集,和原本的数据些微差别,训练集少了100条数据,验证集少了8条数据,测试集少了10条数据,因为datasetSentences.txt 中有些句子里面的人名表示有特殊字符,和 dictionary.txt   不能匹配上,你也可以手动加上。

python代码如下:

def delblankline(infile1,infile2,trainfile,validfile,testfile):
#### 2 是test,3 是valid的我写错了
    info1 = open(infile1,'r')
    info2 = open(infile2,'r')
    train=open(trainfile,'w')
    valid=open(validfile,'w')
    test=open(testfile,'w')
    lines1 = info1.readlines()
    lines2 = info2.readlines()
    for i in range(1,len(lines1)):
        t1=lines1[i].replace("-LRB-","(")
        t2=t1.replace("-RRB-",")")
        ###把括号部分还原
        k=lines2[i].strip().split(",")
        t=t2.strip().split('\t')
        if k[1]=='1':
            train.writelines(t[1])
            train.writelines("\n")
        elif(k[1]=='2'):
            valid.writelines(t[1])
            valid.writelines("\n")
        elif(k[1]=='3'):
            test.writelines(t[1])
            test.writelines("\n")
    print "end"
    info1.close()
    info2.close()
    train.close()
    valid.close()
    test.close()      


def tag(infile1,infile2,outputfile3):
    info1 = open(infile1,'r')
    info2 = open(infile2,'r')
    info3=open(outputfile3,'w')
    lines1 = info1.readlines()
    lines2 = info2.readlines()
    text={}
    for i in range(0,len(lines1)):
        s=lines1[i].strip().split("|")
        text[s[1]]=s[0]
    for j in range(1,len(lines2)):
           k=lines2[j].strip().split("|")
           if(text.has_key(k[0])):
               info3.writelines(text[k[0]])
               info3.writelines("\n")
               info3.writelines(k[1])
               info3.writelines("\n")
    print "end2d1"
    info1.close()
    info2.close()
    info3.close()
           
def tag1(infile0,infile1,infile2,infile3,infile4,infile5,infile6):
    info0 = open(infile0,'r')
    info1 = open(infile1,'r')
    info2 = open(infile2,'r')
    info3 = open(infile3,'r')
    info4 = open(infile4,'w')
    info5 = open(infile5,'w')
    info6 = open(infile6,'w')
    lines0 = info0.readlines()
    lines1 = info1.readlines()
    lines2 = info2.readlines()
    lines3 = info3.readlines()  
    for i in range(0,len(lines0),2):
        if  lines0[i] in lines1:
            info4.writelines(lines0[i])
            info4.writelines(lines0[i+1])
        if  lines0[i] in lines2:
           info5.writelines(lines0[i])
           info5.writelines(lines0[i+1])
        if lines0[i] in lines3:
            info6.writelines(lines0[i])
            info6.writelines(lines0[i+1])
 
    print "end3d1"
    info0.close()
    info1.close()
    info2.close()
    info3.close()
    info4.close()
    info5.close()
    info6.close()


delblankline("datasetSentences.txt","datasetSplit.txt","train.txt","valid.txt","test.txt")
tag("dictionary.txt","sentiment_labels.txt","allsentimet.txt")
###到目前为止都没有任何问题
tag1("allsentimet.txt","train.txt","valid.txt","test.txt","train1.txt","valid1.txt","test1.txt")

处理过后得到的数据为train1.txt valid1.txt  test1.txt

0 0