三十六、对微信公众号文章做样本标注与特征提取

来源：互联网发布：excel统计不同数据个数编辑：程序博客网时间：2024/06/05 06:35

基于上一节实现的web界面的样本标注系统做人工标注，然后详细讲解如何对标注好的样本做挖掘和分析，并根据分析结果提取出最优代表性的特征，用于后面的训练

请尊重原创，转载请注明来源网站www.shareditor.com以及原始链接地址

多类分类问题解法

解法一：通过一系列两类分类问器并将它们组合到一起形成多类分类器

解法二：将多个分类面的参数求解合并到一个最优化问题中

我们利用解法一，通过多个两类分类问题分别计算

人工标注

这部分工作完全是基于个人的判断，逐个文章进行标注，如果判断文章属于纯技术类，则把isTec标记为yes，如果判断为鸡汤文，则把isSoup标记为yes，其他两类也一样

经过我耗时近一小时的纯手工标注，最终每类文章数为：

select sum(isTec), sum(isSoup), sum(isMR), sum(isNews) from CrawlPage;sum(isTec)    sum(isSoup)    sum(isMR)    sum(isNews)31    98    69    240

切词并保存

下面我要把这四个类别的所有文章做切词，为了调试需要，我们把切词之后的中间结果保存在数据库中，以便重复调试不用每次都做切词操作，所以我们在php的CrawlPage实体中增加如下变量：

/** * @var text * @ORM\Column(name="segment", type="text", nullable=true) */private $segment;

执行

php app/console doctrine:schema:update --force

后数据库会多处一列

  `segment` longtext COLLATE utf8_unicode_ci,

创建我们的feature_extract.py，内容如下：

# coding:utf-8import sysfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerimport jiebafrom jieba import analyseimport MySQLdbconn = MySQLdb.connect(host="127.0.0.1",user="myuser",passwd="mypasswd",db="mydatabase",charset="utf8")def get_segment():    cursor = conn.cursor()    sql = "select id, content from CrawlPage"    cursor.execute(sql)    jieba.analyse.set_stop_words("stopwords.txt")    for result in cursor.fetchall():        id = result[0]        content = result[1]        seg_list = jieba.cut(content)        line = ""        for str in seg_list:            line = line + " " + str        line = line.replace('\'', ' ')        sql = "update CrawlPage set segment='%s' where id=%d" % (line, id)        try:            cursor.execute(sql)            conn.commit()        except Exception,e:            print line            print e            sys.exit(-1)    conn.close()if __name__ == '__main__':    get_segment();

请尊重原创，转载请注明来源网站www.shareditor.com以及原始链接地址

这里我们对每一篇文章做切词，并且把切词后的结果存储到segment列中

注意：为了避免sql的语法问题，需要把文章里的单引号'\''去掉，这里我替换成了空格，方便切词识别

计算tf-idf

继续编辑feature_extract.py，增加如下内容：

def feature_extract():    cursor = conn.cursor()    category={}    category[0] = 'isTec'    category[1] = 'isSoup'    category[2] = 'isMR'    category[3] = 'isMath'    category[4] = 'isNews'    corpus=[]    for index in range(0, 5):        sql = "select segment from CrawlPage where " + category[index] + "=1"        cursor.execute(sql)        line = ""        for result in cursor.fetchall():            segment = result[0]            line = line + " " + segment        corpus.append(line)    conn.commit()    conn.close()    vectorizer=CountVectorizer()    csr_mat = vectorizer.fit_transform(corpus)    transformer=TfidfTransformer()    tfidf=transformer.fit_transform(csr_mat)    word=vectorizer.get_feature_names()    print tfidf.toarray()if __name__ == '__main__':    #get_segment();    feature_extract();

执行后输出：

[[ 0.          0.          0.         ...,  0.          0.          0.        ] [ 0.          0.          0.         ...,  0.          0.          0.        ] [ 0.00670495  0.00101195  0.00453306 ...,  0.          0.          0.        ] [ 0.          0.00164081  0.         ...,  0.          0.          0.        ] [ 0.01350698  0.0035783   0.         ...,  0.0003562   0.0003562  0.00071241]]

特征提取

我们采取分别对每一类看做一个两类分类问题来求解，所以对这5大类别分别做特征提取，提取的方式就是提取每一类中tf-idf最大的n个特征，首先我们先把全部特征输出出来

    for index in range(0, 5):        f = file("tfidf_%d" % index, "wb+")        for i in np.argsort(-tfidf.toarray()[index]):            if tfidf.toarray()[index][i] > 0:                f.write("%f %s\n" % (tfidf.toarray()[index][i], word[i]))        f.close()

这已经按照tf-idf从大到小排序了，所以从生成的5个文件里前n行就能拿到我们需要的n个特征啦

下一节我们将通过提取出来的特征来对测试样本进行测试

0 0