Kaggle_news_stock简单文本特征处理

来源:互联网 发布:淘宝开店审核期去哪查 编辑:程序博客网 时间:2024/06/03 13:32

摘要:

这个是https://www.kaggle.com/aaron7sun/stocknews上面的一道DJIA波动预测题,其实也是二分类问题

也是个文本分类问题,特征是文本类型

基本方法是:TF-IDF + SVM 是文本分类问题的基准线


开始数据探索

import pandas as pdimport numpy as npfrom sklearn.svm import SVC
data = pd.read_csv('Combined_News_DJIA.csv')
data.head(10)



df['combined_news'] = df.filter(regex=('Top.*')).apply(lambda x:''.join(str(x.values)),axis=1)#横向的值


由于我们所有消息都需要所以才加文本叠加每个样本的

我们在实际场景需要分割数据,因为没有测试集,此处我们分割一下数据集

train_df = df[df['Date']<'2015-01-01']test_df = df[df['Date']>'2014-12-31']


下面需要TF-IDF的概念:

TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度

字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜寻结果中出现的顺序。


我们用sklearn的给定的文本特征提取的方法:


def extractFeature(train_df,test_df):    feature_extraction = TfidfVectorizer()    train_X = feature_extraction.fit_transform(train_df['combined_news'].values)    test_X = feature_extraction.transform(test_df["combined_news"].values)    return train_X,test_X

好了下面就可以开始模型训练以及测试了:

    #训练    clf = SVC(probability=True,kernel='rbf')#由于是二分类。我们参数probability设置以后保留概率值不经过sign    clf.fit(train_X,train_y)    predictions = clf.predict_proba(test_X)    #验证概率值的auc    metrics.roc_auc_score(test_y,predictions)


我们可以看到得分是比较低的,那么如何进行改进了,显然直接将文本放进TF-IDF是不够的。



特征工程处理

df['combined_news'].str.lower().str.replace('"', '').str.replace("'", '').str.split()#相当于空格划分
  • 删减停止词
from nltk.corpus import stopwordsstop = stopwords.words('english')

  • 删除数字
def hasNumbers(inputString):    return bool(re.compile('\d').search(inputString))
我们用nltk的函数来过滤这些信息
def check(word):    #删减停用词    from nltk.corpus import stopwords    stop = stopwords.words('english')    if word in stop:        return False    elif hasNumbers(word):        return False    else:        return True

得到如下结果:
['[', 'case', 'cancer', 'result', 'sheer', 'bad', 'luck', 'rather', 'unhealthy', 'lifestyles,', 'diet', 'even', 'inherited', 'genes,', 'new', 'research', 'suggests.', 'random', 'mutation', 'occur', 'dna', 'cell', 'divide', 'responsible', 'two', 'third', 'adult', 'cancer', 'across', 'wide', 'range', 'tissues.', 'iran', 'dismissed', 'united', 'state', 'effort', 'fight', 'islamic', 'state', 'ploy', 'advance', 'u.s.', 'policy', 'region:', 'reality', 'united', 'state', 'acting', 'eliminate', 'daesh.', 'even', 'interested', 'weakening', 'daesh,', 'interested', 'managing', 'poll:', 'one', 'german', 'would', 'join', 'anti-muslim', 'march', 'uk', 'royal', 'family', 'prince', 'andrew', 'named', 'u', 'lawsuit', 'underage', 'sex', 'allegation', 'asylum-seekers', 'refused', 'leave', 'bus', 'arrived', 'destination', 'rural', 'northern', 'sweden,', 'demanding', 'taken', 'back', 'malm', 'big', 'city.', 'pakistani', 'boat', 'blow', 'self', 'india', 'navy', 'chase.', 'four', 'people', 'board', 'vessel', 'near', 'pakistani', 'port', 'city', 'karachi', 'believed', 'killed', 'dramatic', 'episode', 'arabian', 'sea', 'new', 'year', 'eve,', 'according', 'india', 'defence', 'ministry.', 'sweden', 'hit', 'third', 'mosque', 'arson', 'attack', 'week', 'car', 'set', 'alight', 'french', 'new', 'year', 'salary', 'top', 'ceo', 'rose', 'twice', 'fast', 'average', 'canadian', 'since', 'recession:', 'study', 'norway', 'violated', 'equal-pay', 'law,', 'judge', 'says:', 'judge', 'find', 'consulate', 'employee', 'unjustly', 'paid', 'le', 'male', 'counterpart', 'imam', 'want', 'radical', 'recruiter', 'muslim', 'youth', 'canada', 'identified', 'dealt', 'saudi', 'arabia', 'beheaded', 'people', 'year', 'living', 'hell', 'slave', 'remote', 'south', 'korean', 'island', '-', 'slavery', 'thrives', 'chain', 'rural', 'island', 'south', 'korea', 'rugged', 'southwest', 'coast,', 'nurtured', 'long', 'history', 'exploitation', 'demand', 'trying', 'squeeze', 'living', 'sea.', 'world', 'richest', 'get', 'richer,', 'adding', 'rental', 'car', 'stereo', 'infringe', 'copyright,', 'music', 'right', 'group', 'say', 'ukrainian', 'minister', 'threatens', 'tv', 'channel', 'closure', 'airing', 'russian', 'entertainer', 'palestinian', 'president', 'mahmoud', 'abbas', 'entered', 'serious', 'confrontation', 'yet', 'israel', 'signing', 'onto', 'international', 'criminal', 'court.', 'decision', 'wednesday', 'give', 'court', 'jurisdiction', 'crime', 'committed', 'palestinian', 'lands.', 'israeli', 'security', 'center', 'publishes', 'name', 'killed', 'terrorist', 'concealed', 'hamas', 'year', 'deadliest', 'year', 'yet', 'syria', 'four-year', 'conflict,', 'killed', 'secret', 'underground', 'complex', 'built', 'nazi', 'may', 'used', 'development', 'wmds,', 'including', 'nuclear', 'bomb,', 'uncovered', 'austria.', 'restriction', 'web', 'freedom', 'major', 'global', 'issue', 'austrian', 'journalist', 'erich', 'mchel', 'delivered', 'presentation', 'hamburg', 'annual', 'meeting', 'chaos', 'computer', 'club', 'monday', 'december', 'detailing', 'various', 'location', 'u', 'nsa', 'actively', 'collecting', 'processing', 'electronic', 'intelligence', 'vienna.', 'thousand', 'ukraine', 'nationalist', 'march', 'kiev', 'china', 'new', 'year', 'resolution:', 'harvesting', 'executed', 'prisoner', 'organ', 'authority', 'pull', 'plug', 'russia', 'last', 'politically', 'independent', 'tv', 'station]']
但是由于是列表,所以我们把调整后的list再变回string
df['combined_news'].apply(lambda x:' '.join(x))
[ case cancer result sheer bad luck rather unhealthy lifestyles, diet even inherited genes, new research suggests. random mutation occur dna cell divide responsible two third adult cancer across wide range tissues. iran dismissed united state effort fight islamic state ploy advance u.s. policy region: reality united state acting eliminate daesh. even interested weakening daesh, interested managing poll: one german would join anti-muslim march uk royal family prince andrew named u lawsuit underage sex allegation asylum-seekers refused leave bus arrived destination rural northern sweden, demanding taken back malm big city. pakistani boat blow self india navy chase. four people board vessel near pakistani port city karachi believed killed dramatic episode arabian sea new year eve, according india defence ministry. sweden hit third mosque arson attack week car set alight french new year salary top ceo rose twice fast average canadian since recession: study norway violated equal-pay law, judge says: judge find consulate employee unjustly paid le male counterpart imam want radical recruiter muslim youth canada identified dealt saudi arabia beheaded people year living hell slave remote south korean island - slavery thrives chain rural island south korea rugged southwest coast, nurtured long history exploitation demand trying squeeze living sea. world richest get richer, adding rental car stereo infringe copyright, music right group say ukrainian minister threatens tv channel closure airing russian entertainer palestinian president mahmoud abbas entered serious confrontation yet israel signing onto international criminal court. decision wednesday give court jurisdiction crime committed palestinian lands. israeli security center publishes name killed terrorist concealed hamas year deadliest year yet syria four-year conflict, killed secret underground complex built nazi may used development wmds, including nuclear bomb, uncovered austria. restriction web freedom major global issue austrian journalist erich mchel delivered presentation hamburg annual meeting chaos computer club monday december detailing various location u nsa actively collecting processing electronic intelligence vienna. thousand ukraine nationalist march kiev china new year resolution: harvesting executed prisoner organ authority pull plug russia last politically independent tv station]

最后我们就可以重新特征抽取成为df-idf来处理了


0 0
原创粉丝点击