每日新闻预测金融市场的变化_版本1

来源:互联网 发布:空耳yaya淘宝 编辑:程序博客网 时间:2024/05/30 07:14

数据来源于国外的网站,类似于国内的贴吧网站

###  RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.

###    DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.

###    Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".


######加载包#####from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizerimport pandas as pdimport numpy as npfrom sklearn.svm import SVCfrom sklearn.metrics import roc_auc_scorefrom datetime import dateimport os

#####导入数据######os.chdir(r'D:/.../..../利用每日新闻预测金融市场变化')data = pd.read_csv('Combined_News_DJIA.csv')

#####将headlines合并#####data["combined_news"] = data.filter(regex = ("Top.*")).apply(lambda x: ''.join(str(x.values)),axis = 1)########分割测试/训练集train = data[data['Date'] < '2015-01-01']test = data[data['Date'] > '2014-12-31']############提取特征#############feature_extraction = TfidfVectorizer()X_train = feature_extraction.fit_transform(train["combined_news"].values)#训练(fit)文本信息,transform我们所需要的TfidfVectorizer模型X_test = feature_extraction.transform(test["combined_news"].values)y_train = train["Label"].values#将label变成numpy输出             y_test = test["Label"].values             #######训练模型#############clf = SVC(probability = True , kernel = 'rbf')clf.fit(X_train,y_train)#预测predictions = clf.predict_proba(X_test)#验证准确度print('ROC-AUC yields' + str(roc_auc_score(y_test,predictions[:,1])))


原创粉丝点击