The analysis of time series by means of python

来源：互联网发布：淘宝win7 激活靠谱不编辑：程序博客网时间：2024/06/08 16:13

来自：http://sysmagazine.com/posts/207160/

Good afternoon, dear readers.
In today's article, I will try to describe process of the analysis of time series by means of python and the unitof statsmodels. The g unit gave a wide dial-up of means and methods for carrying out of statistical analysis and econometricians. I will try to show the main analysis stages of such rows, in the inference we will construct modelof ARIMA.
For an example the actual data on goods turnover of Moscow suburbs one of warehouse complexes are t.

Loading and preliminary data handling

To begin with we will load the data and we will look at them:

from pandas import read_csv, DataFrameimport statsmodels.api as smfrom statsmodels.iolib.table import SimpleTablefrom sklearn.metrics import r2_scoreimport ml_metrics as metricsIn [2]:dataset = read_csv('tovar_moving.csv',';', index_col=['date_oper'], parse_dates=['date_oper'], dayfirst=True)dataset.head()

Otgruzkapriemkadate_oper 2009-09-011796672767122009-09-021776701649992009-09-031521121891812009-09-041429382545812009-09-05130741192486

So, as it are possible to note function of read_csv (), in g sluchem besides instructions of parameters which set us columns and an index, it are possible to note still 3rd parameter for operation with date. Let's stop on them more in detail.
parse_dates set column names who will be transform to type of DateTime. It are necessary to mark that if in the g column there will be empty values parsing it will not be possible and the column of typeof object will return. That it to avoid it are necessary to add parameterof keep_default_na=False.
The final parameter of dayfirst specified functions of parsing that the first in line the first went day, rather the reverse. If not to set this parameter function could transform not correctly dates and confuse month and day in places. For example 2/1/2013 it will be transform at 02-01-2013 that will be irregularly.
Let's select in a separate series a time series with values otgruzkok:

otg = dataset.Otgruzkaotg.head()

date_oper2009-09-011796672009-09-021776702009-09-031521122009-09-041429382009-09-05130741Name: Otgruzka, dtype: int64
So we have now a time series and it are possible to pass to it to the analysis.

Analysis of a time series

To begin with give posmortim of graphics of our row:

otg.plot(figsize=(12,6))

The analysis of time series by means of python

From the schedule it are visible that our row had small kol-in bursts who influenced dispersion. Besides to analyze shipments per every day not absolutely correctly since, for example, in the end or the beginning of week there will be days in which goods it are ship much more, rather than to the remaining. Therefore there are a sense to pass to a week interval and mean value of shipments on it, it will relieve us of bursts and will reduce oscillations of our row. Inpandas for this purpose there are a convenient function of resample (), as parameters the period of a rounding off and an aggregate function are transfer to it:

otg = otg.resample('W', how='mean')otg.plot(figsize=(12,6))

As it are possible to note, the new schedule had no bright bursts and had a strongly pronounced trend. From this it are possible to draw an output that a row are not stationary^[1].

itog = otg.describe()otg.hist()itog

count225mean270858.285365std118371.082975min872.85714325 %180263.42857150 %277898.71428675 %355587.285714max552485.142857dtype: float64
The analysis of time series by means of python

As it are possible to note from characteristics and the histogram, a row at us more less homogeneous and had rather small dispersion about whatthe coefficient of a variation testified: The analysis of time series by means of python

where

— cреднеквадратическое a deviation The analysis of time series by means of python

— an arithmetic average of sampling. It are equal in our case:

print 'V = %f' % (itog['std']/itog['mean'])

V= 0.437022

Let's carry out the test of Harki — the Baire for determination of nomarlnost of allocation to confirm the assumption of homogeneity. For this purpose in there are a functionof jarque_bera () which returned values of g statistics:

row =  [u'JB', u'p-value', u'skew', u'kurtosis']jb_test = sm.stats.stattools.jarque_bera(otg)a = np.vstack([jb_test])itog = SimpleTable(a, row)print itog

Value g statistics testified about volume, the null hypothesis about normality of allocation are reject with small probability (probably> 0.05), and, hence, our row had normal distribution.
Function of SimpleTable () served for design of an output. In our case on an input to it the array of values (dimensionality no more than 2) and the list with titles of columns or lines moved.
Many methods and models was based on assumptions of stationarity of a row but as it were not earlier our row by that most likely are not. Therefore for check of check of stationarity let's carry outthe generalized test of Dikki Fullera for presence of unit roots. For this purpose in the unitof statsmodels there are a function of adfuller ():

test = sm.tsa.adfuller(otg)print 'adf: ', test[0] print 'p-value: ', test[1]print'Critical values: ', test[4]if test[0]> test[4]['5%']:     print 'есть единичные корни, ряд не стационарен'else:    print 'единичных корней нет, ряд стационарен'

adf:-1.38835541357
p-value: 0.58784577297
Critical values: { '5 %':-2.8753374677799957, '1 %':-3.4617274344627398, '10 %':-2.5741240890815571 }
there was unit roots, a row not statsionaren

The carr-out test confirming assumptions about not stationarities of a row. In many cases the takingof a difference of rows allowed to make it. If, for example, the first differences of some statsionarna it are calledas the integrat row of first order.
So, let's define the order of the integrat row for our row:

otg1diff = otg.diff(periods=1).dropna()

In the code above function of diff () calculated a difference of the initial row with near to the given offset of the period. The period of offset are transfer as parameterof period. Since in a difference the first value to turn out uncertain we should get rid of it for this purpose and the method of dropna () are us.
Let's check up the turn-out row on stationarity:

test = sm.tsa.adfuller(otg1diff)print 'adf: ', test[0]print 'p-value: ', test[1]print'Critical values: ', test[4]if test[0]> test[4]['5%']:     print 'есть единичные корни, ряд не стационарен'else:    print 'единичных корней нет, ряд стационарен'

adf:-5.95204224907
p-value: 2.13583392404e-07
Critical values: { '5 %':-2.8755379867788462, '1 %':-3.4621857592784546, '10 %':-2.574231080806213 }
unit roots was not present, a row of statsionaren

Apparently from the code above turn-out row of the first differences coming nearer to the stationary. For the complete confidence razobjem are more its on some intervals and we will be convinc a floor-mat. waitings on different intervals:

m = otg1diff.index[len(otg1diff.index)/2+1]r1 = sm.stats.DescrStatsW(otg1diff[m:])r2 = sm.stats.DescrStatsW(otg1diff[:m])print 'p-value: ', sm.stats.CompareMeans(r1,r2).ttest_ind()[1]

p-value: 0.693072039563

High p-value gave the chance to us to state that the null hypothesis about equality of averages are true that testified to stationarity of a row. It were necessary to be convinc of absence of a trend for this purpose we will construct the schedule of our new row:

otg1diff.plot(figsize=(12,6))

The trend really missed, thus a row of the first differences are stationary, and our initial row —the integrat row of first order.

Creation of model of a time series

For simulation we will use the model of ARIMA construct for a row of the first differences.
So, to construct model to us it are necessary to know it the order consist of 2nd parameters:

p — the order of a component of AR
d — the order of the integrat row
q — the order of komponetna of MA

The parameter of d are also it are ravet 1, it were necessary to definep and q. For them determinations we should studyavtorkorrelyatsionny (ACF) and partially autocorrelated (PACF) of function for a row of the first differences.
ACF will help us to define q, t. to. on it the correlogram can define an amount of autocorrelated coefficients strongly distinct from 0 in modelof MA
PACF will help us to define p, t. to. on it the correlogram can define the maximum number of coefficient strongly distinct from 0 in modelof AR.
To construct appropriate correlograms, in a packet of statsmodels there was following functions:plot_acf () and plot_pacf (). They deduced ACF and PACF graphics at which on a x axis numbers of logs, and on a y axis of value of appropriate functions was postpon. It are necessary to mark that the amount of logs in AND functions defined number of significant coefficients. So, our functions looked so:

ig = plt.figure(figsize=(12,8))ax1 = fig.add_subplot(211)fig = sm.graphics.tsa.plot_acf(otg1diff.values.squeeze(), lags=25, ax=ax1)ax2 = fig.add_subplot(212)fig = sm.graphics.tsa.plot_pacf(otg1diff, lags=25, ax=ax2)

After learning of the correlogram of PACF it are possible to draw an output thatp = 1, since on it only 1 log strongly otlichnen from zero. On the correlogramof ACF it are possible to see that q = 1 since after a log 1 value of functions sharply fallen.
So, when all parameters was known it are possible to construct model, but for it creations we will take not all data, but only a part. From a part not g to model we will leave the data for check of exactitude of the forecast of our model:

src_data_model = otg[:'2013-05-26']model = sm.tsa.ARIMA(src_data_model, order=(1,1,1), freq='W').fit(full_output=False, disp=0)

The parameter of trend are responsible for presence of a constant in motel. Let's deduce informatsayu on the turn-out model:

print model.summary()

Apparently from the g information in our model all coefficients significant also can be pass to an estimation of model.

Analysis and estimation of model

Let's check up residuals of the g model on correspondence to "white noise», and also we will analyze the correlogram of residuals as it could help us with determination important for switching-on and prediction of elements of a regression.
So the first that we will make it we will lead the Q-test of Lyyunga — Boxing for check of a hypothesis that residuals was casual, t. e. was «white noise». The g test are le on residuals of modelof ARIMA. Thus, we should receive at first residuals of model and to construct for them ACF, and then to the turn-out coefficients to notice the test. By means ofstatsmadels it can be ma so:

q_test = sm.tsa.stattools.acf(model.resid, qstat=True) #свойство resid, хранит остатки модели, qstat=True, означает что применяем указынный тест к коэф-амprint DataFrame({'Q-stat':q_test[1], 'p-value':q_test[2]})

Result

Q-statp-value00.5314260.46600813.0732170.21510923.6442290.30253233.9063260.41883244.7014330.45339355.4337450.48950065.4442540.60591675.4453090.70909185.9007620.74980896.0049280.814849106.1559660.862758116.2999580.9002131212.7315420.4687551314.7078940.3984101420.7206070.1459961523.1974330.1085581623.9498010.1208051724.1192360.1511601825.6161840.1412431926.0351650.1646542028.9698800.1147272128.9736600.1456142229.0177160.1797232332.1140060.1241912432.2848050.1499362533.1233950.1585482633.1290590.1928442733.7604880.2088702838.4210530.1132552938.7242260.1320283038.9734260.1538633138.9781720.1846133239.3189540.2078193339.3824720.2416233439.4237630.2786153540.0836890.2938603643.8495150.2037553745.7044760.1825763847.1329110.1741173947.3653050.197305

Value of g statistics and p-values, testified that the hypothesis about randomness of residuals are not reject, and most likely the g process of predstvlyaet «white noise».
Now let's calculate coefficient of determination The analysis of time series by means of python

to understand what percent of observations the g model described:

pred = model.predict('2013-05-26','2014-12-31', typ='levels')trn = otg['2013-05-26':]r2 = r2_score(trn, pred[1:32])print 'R^2: %1.2f' % r2

R^2:-0.03

Root mean square deviation ^[2] sew models:

metrics.rmse(trn,pred[1:32])

80919.057367642512

Average absolute error ^[2] forecasts:

metrics.mae(trn,pred[1:32])

63092.763277651895

It were necessary to draw our forecast on graphics:

otg.plot(figsize=(12,6))pred.plot(style='r--')

Inference

As it are possible to note from the schedule our model built not so good forecast. Partly it are connect with bursts in the initial data, which we not till the end of, and also with the unitof ARIMA of a packet of statsmodels, t. to. it new enough. Article are more direct on that it are possible to show how to analyze time series on python. Also it would be desirable to mark that in consider today a packet various methods regression analysis (I will try to show in the further articles) very full was implement.
As a whole for small researches a packet of statsmodels in ma look fat it are suitable, but for serious scientific operation nevertheless while syrovat both some tests and statistics in it missed.