Kaggle_news_stock简单文本特征处理
来源:互联网 发布:淘宝开店审核期去哪查 编辑:程序博客网 时间:2024/06/03 13:32
摘要:
这个是https://www.kaggle.com/aaron7sun/stocknews上面的一道DJIA波动预测题,其实也是二分类问题
也是个文本分类问题,特征是文本类型
基本方法是:TF-IDF + SVM 是文本分类问题的基准线
开始数据探索
import pandas as pdimport numpy as npfrom sklearn.svm import SVC
data = pd.read_csv('Combined_News_DJIA.csv')
data.head(10)
df['combined_news'] = df.filter(regex=('Top.*')).apply(lambda x:''.join(str(x.values)),axis=1)#横向的值
由于我们所有消息都需要所以才加文本叠加每个样本的
我们在实际场景需要分割数据,因为没有测试集,此处我们分割一下数据集
train_df = df[df['Date']<'2015-01-01']test_df = df[df['Date']>'2014-12-31']
TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度
字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜寻结果中出现的顺序。
我们用sklearn的给定的文本特征提取的方法:
def extractFeature(train_df,test_df): feature_extraction = TfidfVectorizer() train_X = feature_extraction.fit_transform(train_df['combined_news'].values) test_X = feature_extraction.transform(test_df["combined_news"].values) return train_X,test_X
好了下面就可以开始模型训练以及测试了:
#训练 clf = SVC(probability=True,kernel='rbf')#由于是二分类。我们参数probability设置以后保留概率值不经过sign clf.fit(train_X,train_y) predictions = clf.predict_proba(test_X) #验证概率值的auc metrics.roc_auc_score(test_y,predictions)
我们可以看到得分是比较低的,那么如何进行改进了,显然直接将文本放进TF-IDF是不够的。
特征工程处理
df['combined_news'].str.lower().str.replace('"', '').str.replace("'", '').str.split()#相当于空格划分
- 删减停止词
from nltk.corpus import stopwordsstop = stopwords.words('english')
- 删除数字
def hasNumbers(inputString): return bool(re.compile('\d').search(inputString))我们用nltk的函数来过滤这些信息
def check(word): #删减停用词 from nltk.corpus import stopwords stop = stopwords.words('english') if word in stop: return False elif hasNumbers(word): return False else: return True
得到如下结果:
['[', 'case', 'cancer', 'result', 'sheer', 'bad', 'luck', 'rather', 'unhealthy', 'lifestyles,', 'diet', 'even', 'inherited', 'genes,', 'new', 'research', 'suggests.', 'random', 'mutation', 'occur', 'dna', 'cell', 'divide', 'responsible', 'two', 'third', 'adult', 'cancer', 'across', 'wide', 'range', 'tissues.', 'iran', 'dismissed', 'united', 'state', 'effort', 'fight', 'islamic', 'state', 'ploy', 'advance', 'u.s.', 'policy', 'region:', 'reality', 'united', 'state', 'acting', 'eliminate', 'daesh.', 'even', 'interested', 'weakening', 'daesh,', 'interested', 'managing', 'poll:', 'one', 'german', 'would', 'join', 'anti-muslim', 'march', 'uk', 'royal', 'family', 'prince', 'andrew', 'named', 'u', 'lawsuit', 'underage', 'sex', 'allegation', 'asylum-seekers', 'refused', 'leave', 'bus', 'arrived', 'destination', 'rural', 'northern', 'sweden,', 'demanding', 'taken', 'back', 'malm', 'big', 'city.', 'pakistani', 'boat', 'blow', 'self', 'india', 'navy', 'chase.', 'four', 'people', 'board', 'vessel', 'near', 'pakistani', 'port', 'city', 'karachi', 'believed', 'killed', 'dramatic', 'episode', 'arabian', 'sea', 'new', 'year', 'eve,', 'according', 'india', 'defence', 'ministry.', 'sweden', 'hit', 'third', 'mosque', 'arson', 'attack', 'week', 'car', 'set', 'alight', 'french', 'new', 'year', 'salary', 'top', 'ceo', 'rose', 'twice', 'fast', 'average', 'canadian', 'since', 'recession:', 'study', 'norway', 'violated', 'equal-pay', 'law,', 'judge', 'says:', 'judge', 'find', 'consulate', 'employee', 'unjustly', 'paid', 'le', 'male', 'counterpart', 'imam', 'want', 'radical', 'recruiter', 'muslim', 'youth', 'canada', 'identified', 'dealt', 'saudi', 'arabia', 'beheaded', 'people', 'year', 'living', 'hell', 'slave', 'remote', 'south', 'korean', 'island', '-', 'slavery', 'thrives', 'chain', 'rural', 'island', 'south', 'korea', 'rugged', 'southwest', 'coast,', 'nurtured', 'long', 'history', 'exploitation', 'demand', 'trying', 'squeeze', 'living', 'sea.', 'world', 'richest', 'get', 'richer,', 'adding', 'rental', 'car', 'stereo', 'infringe', 'copyright,', 'music', 'right', 'group', 'say', 'ukrainian', 'minister', 'threatens', 'tv', 'channel', 'closure', 'airing', 'russian', 'entertainer', 'palestinian', 'president', 'mahmoud', 'abbas', 'entered', 'serious', 'confrontation', 'yet', 'israel', 'signing', 'onto', 'international', 'criminal', 'court.', 'decision', 'wednesday', 'give', 'court', 'jurisdiction', 'crime', 'committed', 'palestinian', 'lands.', 'israeli', 'security', 'center', 'publishes', 'name', 'killed', 'terrorist', 'concealed', 'hamas', 'year', 'deadliest', 'year', 'yet', 'syria', 'four-year', 'conflict,', 'killed', 'secret', 'underground', 'complex', 'built', 'nazi', 'may', 'used', 'development', 'wmds,', 'including', 'nuclear', 'bomb,', 'uncovered', 'austria.', 'restriction', 'web', 'freedom', 'major', 'global', 'issue', 'austrian', 'journalist', 'erich', 'mchel', 'delivered', 'presentation', 'hamburg', 'annual', 'meeting', 'chaos', 'computer', 'club', 'monday', 'december', 'detailing', 'various', 'location', 'u', 'nsa', 'actively', 'collecting', 'processing', 'electronic', 'intelligence', 'vienna.', 'thousand', 'ukraine', 'nationalist', 'march', 'kiev', 'china', 'new', 'year', 'resolution:', 'harvesting', 'executed', 'prisoner', 'organ', 'authority', 'pull', 'plug', 'russia', 'last', 'politically', 'independent', 'tv', 'station]']但是由于是列表,所以我们把调整后的list再变回string
df['combined_news'].apply(lambda x:' '.join(x))[ case cancer result sheer bad luck rather unhealthy lifestyles, diet even inherited genes, new research suggests. random mutation occur dna cell divide responsible two third adult cancer across wide range tissues. iran dismissed united state effort fight islamic state ploy advance u.s. policy region: reality united state acting eliminate daesh. even interested weakening daesh, interested managing poll: one german would join anti-muslim march uk royal family prince andrew named u lawsuit underage sex allegation asylum-seekers refused leave bus arrived destination rural northern sweden, demanding taken back malm big city. pakistani boat blow self india navy chase. four people board vessel near pakistani port city karachi believed killed dramatic episode arabian sea new year eve, according india defence ministry. sweden hit third mosque arson attack week car set alight french new year salary top ceo rose twice fast average canadian since recession: study norway violated equal-pay law, judge says: judge find consulate employee unjustly paid le male counterpart imam want radical recruiter muslim youth canada identified dealt saudi arabia beheaded people year living hell slave remote south korean island - slavery thrives chain rural island south korea rugged southwest coast, nurtured long history exploitation demand trying squeeze living sea. world richest get richer, adding rental car stereo infringe copyright, music right group say ukrainian minister threatens tv channel closure airing russian entertainer palestinian president mahmoud abbas entered serious confrontation yet israel signing onto international criminal court. decision wednesday give court jurisdiction crime committed palestinian lands. israeli security center publishes name killed terrorist concealed hamas year deadliest year yet syria four-year conflict, killed secret underground complex built nazi may used development wmds, including nuclear bomb, uncovered austria. restriction web freedom major global issue austrian journalist erich mchel delivered presentation hamburg annual meeting chaos computer club monday december detailing various location u nsa actively collecting processing electronic intelligence vienna. thousand ukraine nationalist march kiev china new year resolution: harvesting executed prisoner organ authority pull plug russia last politically independent tv station]最后我们就可以重新特征抽取成为df-idf来处理了
0 0
- Kaggle_news_stock简单文本特征处理
- Spark CountVectorizer处理文本特征
- 特征工程与文本处理
- 特征工程与文本处理
- 简单的文本处理
- 简单的文本处理
- 简单的文本处理
- 简单的文本处理
- Linux简单文本处理
- 简单的文本处理
- 自然语言处理中的文本处理和特征工程
- linux下简单文本处理
- Linux简单的文本处理
- 简单文本处理小工具
- 8、linux简单文本处理
- Linux简单文本处理命令
- 简单的文本处理命令
- 图像处理之特征提取:HOG特征简单梳理
- Microsoft Kinect 2 and ubuntu 14.04 and ROS Indigo and RGBD SLAM and RTAB Map 详细配置及comments
- PAT甲级 1006.Sign In and Sign Out(25) 题目翻译与答案
- 题目1:斐波那契数列(兔子出生规律)
- javascript this的一些误解
- 757B Bash's Big Day
- Kaggle_news_stock简单文本特征处理
- 使用python开发json、csv数据格式转换工具
- mysql存储过程 --游标的使用 取每行记录 (多字段)
- leetcode Longest Substring Without Repeating Characters解题记录
- 百度C++面试题
- 软件上线标准
- redis
- Verilog HDL语言中always敏感信号对比分析
- spring+sprngMVC+mybatis整合