1.1股票数据预处理练习
来源:互联网 发布:天谕玉虚捏脸数据下载 编辑:程序博客网 时间:2024/06/15 08:22
第一阶段、一个简单策略入门量化投资
1.1股票数据预处理练习
无论我们要对股票市场进行何种探索,在开始前,研究如何获取数据,并进行对应的预处理都是必要的。
本节以美股为例,进行股票数据预处理的练习。正文如下:
利用Yahoo财经提供的接口,获取一家公司的股票是相当容易的。下面这段代码可以获取苹果公司16年至今的股数
据。
import pandas as pdimport pandas_datareader.data as webimport datetimestart = datetime.datetime(2010,1,1)end = datetime.date.today()apple = web.DataReader("AAPL", "yahoo", start, end)print(apple.head())
得到的数据如下所示:
Open High Low Close Adj Close Volume
Date
2015-12-31 107.010002 107.029999 104.820000 105.260002 101.703697 40912300
2016-01-04 102.610001 105.370003 102.000000 105.349998 101.790649 67649400
2016-01-05 105.750000 105.849998 102.410004 102.709999 99.239845 55791000
2016-01-06 100.559998 102.370003 99.870003 100.699997 97.297760 68457400
2016-01-07 98.680000 100.129997 96.430000 96.449997 93.191338 81094400
你也许已经发现,受网络的影响,上面这段代码不一定能够运行成功,可能出现连接远程服务器失败的情况,那么把获取到的数据存到本地,需要时再读取就是件很自然的工作了。下面这段代码,模拟了将苹果公司股票数据保存成
.csv
文件,并在需要时读取的过程。
##### save dataapple.to_csv(path_or_buf='data_AAPL.csv')##### read the data from .csv when needapple=pd.read_csv(filepath_or_buffer='data_AAPL.csv')print(apple.head())
请仔细对比从
.csv
文件中重新读取的数据发生的变化
Date Open High Low Close Adj Close Volume
0 2009-12-31 30.447144 30.478571 30.080000 30.104286 27.083506 88102700
1 2010-01-04 30.490000 30.642857 30.340000 30.572857 27.505054 123432400
2 2010-01-05 30.657143 30.798571 30.464285 30.625713 27.552608 150476200
3 2010-01-06 30.625713 30.747143 30.107143 30.138571 27.114347 138040000
4 2010-01-07 30.250000 30.285715 29.864286 30.082857 27.064222 119282800
可以看到,重新读取的数据的索引列发生了变化,这并不希望被看到,因为使用时间作为数据的索引列将会使后续的数据处理更加方便,也更加合理。
因此,使用如下代码来修改从
.csv
中读取的数据,使其恢复最初的样子。
date_list = []for i in range(len(apple)): date_str = apple['Date'][i] t = time.strptime(date_str, "%Y-%m-%d") temp_date = datetime.datetime(t[0], t[1], t[2]) date_list.append(temp_date)apple['DateTime'] = pd.Series(date_list,apple.index)del apple['Date']apple = apple.set_index('DateTime')
还有一点需要注意
,从
Yahoo
获取的数据中,只有收盘价提供了调整后的收盘价。但使用调整后收盘价与收盘价的比例,可以很容易的将开盘价,最低价,最高价的调整后价格计算出来。于是实现如下函数:
def ohlc_adjust(dat): return pd.DataFrame({"Open": dat["Open"] * dat["Adj Close"] / dat["Close"], "High": dat["High"] * dat["Adj Close"] / dat["Close"], "Low": dat["Low"] * dat["Adj Close"] / dat["Close"], "Close": dat["Adj Close"]})
最后
,我们将上面的内容进行整合,使得程序能够批量的获取、保存、读取、修改不同公司的股票数据。这分别通过实现
stockdata_preProcess.py
中的三个函数实现
(代码在文末)
:
downloadAndSaveData()
repairAndGetData()
ohlc_adjust()
此时,我们只需提供一个所关心公司的股票代码列表即可完成预处理工作,例如:
listed_company_list = ["AAPL", "MSFT", "GOOG", "FB", "TWTR", "NFLX", "AMZN", "SNY", "NTDOY", "IBM", "HPQ"]
调用函数
downloadAndSaveData(listed_company_list,start,end)
,可以自动获取所提供列表中的公司,从
start
到
end
时间段内的股票数据。由于网络可能出现问题,代码中还加入了失败重连的机制。
测试效果如下,所需的数据已经都保存到相应的
.csv
文件中了:
-----------------------------------------------------------------------------------------------------------------------------------------------------------
完整代码:
import pandas as pdimport pandas_datareader.data as webimport datetimeimport timeimport os# download the stock data# parameter explanation:# start & end : the time interval of the data we want to get(from start to end)# e.g : start = datetime.datetime(2010, 1, 1)# end = datetime.date.today()# listed_company_list : the list of listed companies that we are concerned about# e.g : listed_company_list = ["AAPL", "MSFT", "GOOG", "FB", "TWTR", "NFLX", "AMZN", "YHOO", "SNY", "NTDOY", "IBM", "HPQ"]def downloadAndSaveData(listed_company_list, start, end): downloadResult_list=[] # use downloadResult_list to denote whether the data has download successfully for index in range(len(listed_company_list)): downloadResult_list.append(False) # start downloading data... for index in range(len(listed_company_list)): companyStr = listed_company_list[index] filename = "data_" + companyStr + ".csv" if os.path.exists(filename): # if the file has existed, we don't need to download again print(companyStr+"'s data has already exists ^_^") downloadResult_list[index]=True continue tryNumbers = 0 max_tryNumbers = 10 while tryNumbers<max_tryNumbers : try: print(companyStr + " data connecting start...") # try to get data, this may throw exception data = web.DataReader(companyStr, "yahoo", start, end) # save data in .csv data.to_csv(path_or_buf=filename) print(companyStr + "'s data has successfully saved in " + filename + " ^_^") downloadResult_list[index]=True time.sleep(10) break except Exception as e: print("error:",e) print("connecting failed, waiting to reconnect...") tryNumbers += 1 time.sleep(5*tryNumbers) if tryNumbers == max_tryNumbers: print("give up to get "+companyStr+"'s data -_-|") print("the result shows below:") for index in range(len(listed_company_list)): print(listed_company_list[index] +" : " + str(downloadResult_list[index])) return downloadResult_list# get the data we save in .csv file (download and save by function downloadAndSaveData)# and then return the repaired data to the user# why repair?# note that some format(data type) of data we read from .csv has changed# for example the attribute 'Date' should be the index of the dataframe, and the date type changed from datetime to string# this changes would made some methods got trouble. So we need to repair the data before returndef repairAndGetData(listed_company_list): companyNumber = len(listed_company_list) DataSetList = [] # traverse all the listed companies for c in range(companyNumber): cur_companyStr = listed_company_list[c] cur_fileName = "data_" + cur_companyStr + ".csv" cur_companyData = pd.read_csv(filepath_or_buffer=cur_fileName) # repair current company's data # change the data type of attribute "Date" from string to datetime, and let it become the index of the dataframe date_list = [] for i in range(len(cur_companyData)): date_str = cur_companyData['Date'][i] t = time.strptime(date_str, "%Y-%m-%d") temp_date = datetime.datetime(t[0], t[1], t[2]) date_list.append(temp_date) cur_companyData['DateTime'] = pd.Series(date_list, cur_companyData.index) del cur_companyData['Date'] cur_companyData = cur_companyData.set_index('DateTime') # save the repaired data DataSetList.append(cur_companyData) # return all the repaired data in the original order return DataSetList# adjust the price of ohlc("Open","High","Low","Close")# normally only interface only provides the adjust price of 'Close'# but it is easy to adjust the price by the proportion of 'Adj Close' and 'Close'def ohlc_adjust(dat): return pd.DataFrame({"Open": dat["Open"] * dat["Adj Close"] / dat["Close"], "High": dat["High"] * dat["Adj Close"] / dat["Close"], "Low": dat["Low"] * dat["Adj Close"] / dat["Close"], "Close": dat["Adj Close"]})
- 1.1股票数据预处理练习
- 股票数据预处理
- python数据预处理练习
- python数据预处理练习
- Python股票处理之六_数据预处理A
- R 语言实现股票数据的预处理及分析
- 数据预处理练习(深度学习)
- Deep learning:三十一(数据预处理练习)
- 股票数据
- 股票数据
- 股票数据
- C项目练习:局用程控交换机数据预处理系统
- Kaggle房价预测:数据预处理——练习
- 数据预处理
- 数据预处理
- 数据预处理
- 数据预处理
- 数据预处理
- 基础(printf 的用法 && 结构体重载运算符)
- Java I/O —— File类
- 剑指Offer------把二叉树打印成多行(层序遍历)
- 链表笔记一
- 习题3
- 1.1股票数据预处理练习
- C#225课的主要内容
- 8. String to Integer (atoi)
- 编写代码模拟三次密码输入
- String和StringBuffer的区别
- 链表排序--归并排序
- 如何搭建hustoj
- C++&Pascal&Python——【USACO 4.2.2】——The Perfect Stall
- Wannafly挑战赛1 Treepath