基于时间序列模型的预测

来源:互联网 发布:淘宝店简介怎么写 编辑:程序博客网 时间:2024/05/01 06:07

时间序列预测

自相关与偏自相关
读取数据
可视化数据
稳定时间序列
ARIMA模型
可视化时间序列预测
在需要考虑季节性上的模型使用SARIMA模型
SARIMA模型

Jupyter 设置

In [1]:
%%javascriptIPython.OutputArea.prototype._should_scroll = function(lines) {    return false;}

使用pandas 和 pyplot

In [2]:
import pandas as pd, numpy as npimport matplotlib.pyplot as plt#Jupyter 设置%matplotlib inline

读取数据及简单的预处理

载入数据集

In [3]:
df_fx_data = pd.read_csv('BOE-XUDLERD.csv')df_fx_data.head()
Out[3]:
 DateValue02017-06-090.894112017-06-080.891222017-06-070.887832017-06-060.887542017-06-050.8888

查看日期的数据类型:str

In [4]:
type(df_fx_data['Date'][0])
Out[4]:
str
In [5]:
df_fx_data['Date'] = pd.to_datetime(df_fx_data['Date'])df_fx_data.head()
Out[5]:
 DateValue02017-06-090.894112017-06-080.891222017-06-070.887832017-06-060.887542017-06-050.8888

现在日期类型为: pandas._libs.tslib.Timestamp

In [6]:
type(df_fx_data['Date'][0])
Out[6]:
pandas._libs.tslib.Timestamp

用变量Date作为索引

In [7]:
indexed_df = df_fx_data.set_index('Date')

时间序列的变量即为Value

In [8]:
ts = indexed_df['Value']ts.head(5)
Out[8]:
Date2017-06-09    0.89412017-06-08    0.89122017-06-07    0.88782017-06-06    0.88752017-06-05    0.8888Name: Value, dtype: float64

绘制折线图

In [9]:
plt.plot(ts)
Out[9]:
[<matplotlib.lines.Line2D at 0x23e7a80d470>]

重取样时间

以周为最小单位重取样时间,重取样方式为每周的均值

In [10]:
ts_week = ts.resample('W').mean()ts_week.plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e7a8472e8>

resample的规则:
年 月 日 周:A M D W 
时 分 秒:H T S 
每3年:3A 
每5小时:5H 

以每两年为最小单位重取样时间,重取样方式为每年的加总

In [11]:
ts.resample('2A').sum().plot()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e7a82b6a0>

自相关与偏自相关函数

自相关: lag=n时为 YtYt 和 YtnYt−n 的相关系数 
偏自相关: lag=n时 在不考虑YtYt 和 YtnYt−n与之间其他$Y_{t-m}\quad(m<n)$ 相关性的情况下="" $y_t$="" 和="" $y_{t-n}$的相关性

In [12]:
from statsmodels.graphics.tsaplots import plot_pacf, plot_acffrom matplotlib.ticker import MaxNLocatordef acf_pacf(series, lags=40, title=None, figsize=(10,6)):    #ACF and PACF plots    # 求自相关和偏自相关函数    # series 输入的时间序列    # lags 自相关和偏自相关函数的滞后取值范围    fig = plt.figure(figsize=figsize)    ax1 = fig.add_subplot(211)    ax1.set_xlabel('lags')    ax1.set_ylabel('Autocorrelation')    # x坐标为整数    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))    plot_acf(series.dropna(), lags=lags, ax=ax1, title='ACF of %s'%title)    ax2 = fig.add_subplot(212)    ax2.set_xlabel('lags')    ax2.set_ylabel('Partial Autocorrelation')    ax2.xaxis.set_major_locator(MaxNLocator(integer=True))    plot_pacf(series.dropna(), lags=lags, ax=ax2, title='PACF of %s'%title)    fig.tight_layout()
C:\Users\iota\Anaconda3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.  from pandas.core import datetools

ts_week在lag数较高的区间自相关强并显著, 但lags=2以上的自相关很可能仅是由的lag=1的自相关累积下来的. PACF的lag=1 可以印证这一点.

In [13]:
acf_pacf(ts_week, lags=20, title='Weekly Euro Dollar Exchange Rate')

白噪声检验

一个白噪声序列
浅色部分为95%置信度区间

In [14]:
from random import gaussfrom random import seedfrom pandas import Seriesfrom pandas.tools.plotting import autocorrelation_plot# seed random number generatorseed(1)# create white noise serieswhitenoise = Series([gauss(0.0, 1.0) for i in range(1000)])acf_pacf(whitenoise, lags=40, title='White Noise Series')

汇率时间序列, 显然不是一个白噪声序列

In [15]:
acf_pacf(ts_week, lags=40, title='Weekly Euro Dollar Exchange Rate')

稳定时间序列

检查稳定性(Stationarity)

稳定性(Stationarity): 在稳定的时间序列中变量的统计属性,诸如均值(mean),方差(variance),自相关(autocorrelation),自协方差(autocovariance) 等不随时间改变。

adfuller(timeseries, autolag='AIC') autolag: 按某个度量自动选择测试使用的lag数量以使该度量最小化
AIC: Akaike Information Criterion, 估算两个回归模型之间的KL散度/相对熵; KL散度/相对熵用于衡量近似分布与真实分布的接近程度
p-value: 显著性概率

In [16]:
from statsmodels.tsa.stattools import adfullerdef test_stationarity(timeseries, window=10):    # 求移动平均和移动标准差    # window为选取的时间窗口个数    rolmean = timeseries.rolling(window=window, center=False).mean()    rolstd = timeseries.rolling(window=window, center=False).std()    # 将以周重取样后的时间序列、移动平均和移动标准差制成折线图    orig = plt.plot(timeseries, color='blue', label='Original')    mean = plt.plot(rolmean, color='red', label='Rolling Mean')    std = plt.plot(rolstd, color='black', label='Rolling Std')    plt.legend(loc='best')    plt.title('Rolling Mean & Standard Deviation')    plt.show(block=False)    # 用Augmented Dickey-Fuller检验测试时间序列稳定性:    print('Results of Augmented Dickey-Fuller Test:')    # 使用减小AIC的办法估算ADF测试所需的滞后数    dftest = adfuller(timeseries, autolag='AIC')    # 将ADF测试结果、显著性概率、所用的滞后数和所用的观测数打印出来    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value',                                              'Num Lags Used',                                              'Num Observations Used'])    for key, value in dftest[4].items():        dfoutput['Critical Value (%s)' % key] = value    print(dfoutput)
In [17]:
test_stationarity(ts_week)
Results of Augmented Dickey-Fuller Test:Test Statistic             -2.058902p-value                     0.261358Num Lags Used               2.000000Num Observations Used    2212.000000Critical Value (1%)        -3.433310Critical Value (10%)       -2.567466Critical Value (5%)        -2.862848dtype: float64

-2.217962 > Critical Value (5%), p-value > 0.05

移除趋势和季节性

seasonaldecompose: 按照滑动均值将维经过指数转化的时间序列分为趋势(trend), 季节性(seasonality)和残差(residual)三部分
有明显的趋势但季节性并不明显(数值变化范围小)
残差不是白噪音
对数转换: np.log(series)
使用第一差分去掉趋势: $Y' = Y
{t} - Y_{t-1}$
比较发现这里log再差分的平稳性较仅差分要好

In [18]:
from statsmodels.tsa.seasonal import seasonal_decomposefrom statsmodels.stats.diagnostic import acorr_ljungboxdef decompose_plot(series, title=''):    decomposition = seasonal_decompose(series)    trend = decomposition.trend    seasonal = decomposition.seasonal    residual = decomposition.resid    fig = decomposition.plot()    fig.set_size_inches(10,6)    fig.suptitle(title)    fig.tight_layout()    fig2 = acf_pacf(residual, title='Residuals', figsize=(10,4))# 每周汇率decompose_plot(ts_week, title='ts_week')
In [19]:
# 使用第一差分去掉趋势ts_week_diff = ts_week - ts_week.shift(1)test_stationarity(ts_week_diff.dropna())decompose_plot(ts_week_diff.dropna(), title='ts_week_diff')
Results of Augmented Dickey-Fuller Test:Test Statistic            -29.988468p-value                     0.000000Num Lags Used               1.000000Num Observations Used    2212.000000Critical Value (1%)        -3.433310Critical Value (10%)       -2.567466Critical Value (5%)        -2.862848dtype: float64
In [20]:
# 对数转换, 使用第一差分去掉趋势ts_week_log = np.log(ts_week)ts_week_log_diff = ts_week_log - ts_week_log.shift(1)test_stationarity(ts_week_log_diff.dropna())decompose_plot(ts_week_log_diff.dropna(), title='ts_week_log_diff')plt.show()
Results of Augmented Dickey-Fuller Test:Test Statistic            -36.417784p-value                     0.000000Num Lags Used               0.000000Num Observations Used    2213.000000Critical Value (1%)        -3.433308Critical Value (10%)       -2.567466Critical Value (5%)        -2.862847dtype: float64

建立模型

ARIMA(p, d, q)

找到合适的差分阶数 d

刚才通过一次差分已达到平稳 d=1 
也可以通过ACF和PACF来找到合适的d

In [21]:
acf_pacf(ts_week_log, title='ts_week_log')

序列在lag>1出有显著的正自相关, 需提高差分阶数
lag=1 时自相关为0或负值, 无需差分
lag=1 时自相关< -0.5, 则为过差分
因此ts_week_log 需要差分一次, ts_week_log_diff是差分一次后的ts_week_log

In [22]:
acf_pacf(ts_week_log_diff, title='ts_week_log_diff', lags=10)

序列在lag>1已无显著的正自相关, 无需差分

AR和MA项数 p, q

得到平稳的时间序列后需要决定p和q
参照上图

p: 如果PACF(lag<=k)显著且PACF(lag>k)均不显著, 考虑将p设为k
q: 如果ACF(lag<=k)显著且ACF(lag>k)均不显著, 考虑将q设为k

p: 如果差分后序列的PACF迅速衰减到0而ACF在lag较高处仍有显著的较强的相关, 和/或 ACF(lag=1) > 0, 考虑增加p
q: 如果差分后序列的ACF迅速衰减到0, 和/或 ACF(lag=1) < 0, 考虑增加q

绘出拟合后的ARIMA的拟合残差的ACF和PACF, 按上述规则调整p和/或q
下图为ARIMA(0,1,0)的拟合残差
可以考虑p=1, q=1

In [23]:
from statsmodels.tsa.arima_model import ARIMA# ARIMA的参数: order = p, d, q model = ARIMA(ts_week_log, order=(0, 1, 0))  # 已拟合的模型results_ARIMA = model.fit()  # 拟合的数据和原始数据的残差residuals = pd.DataFrame(results_ARIMA.resid)acf_pacf(residuals, lags=10, title='Residuals, p=0, d=1, q=0')

这里ARIMA(1,1,0), ARIMA(0,1,1) 和 ARIMA(1,1,1)都是可用的
ARIMA模型是通过优化AIC, HQIC, BIC等得出的, 一般来说AIC较小的模型拟合度较好, 但拟合度较好不能说明预测能力好

In [24]:
a110=ARIMA(ts_week_log, order=(1, 1, 0)).fit()a011=ARIMA(ts_week_log, order=(0, 1, 1)).fit()a111=ARIMA(ts_week_log, order=(1, 1, 1)).fit()acf_pacf(a110.resid, lags=10, title='Residuals, p=1, d=1, q=0, aic=%f'%a110.aic)acf_pacf(a011.resid, lags=10, title='Residuals, p=1, d=1, q=1, aic=%f'%a011.aic)acf_pacf(a111.resid, lags=10, title='Residuals, p=0, d=1, q=1, aic=%f'%a111.aic)

拟合ARIMA模型

In [25]:
from statsmodels.tsa.arima_model import ARIMA# ARIMA的参数: order = p, d, q model = ARIMA(ts_week_log, order=(1, 1, 1))  # 已拟合的模型results_ARIMA = model.fit()  # ARIMA的结果print(results_ARIMA.summary())# 比较拟合的数据和原始数据,# 注意fittedvalues返回的是d次差分后的序列, 因此和ts_week_log_diff比较plt.plot(ts_week_log_diff)plt.plot(results_ARIMA.fittedvalues, color='red')# 拟合的数据和原始数据的残差residuals = pd.DataFrame(results_ARIMA.resid)# 计算拟残差平方和(RSS)plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_week_log_diff)**2))acf_pacf(residuals, lags=10, title='Residuals')# 残差的核密度估计residuals.plot(kind='kde')print('\n\nResiduals Summary:\n', residuals.describe())plt.show()
                             ARIMA Model Results                              ==============================================================================Dep. Variable:                D.Value   No. Observations:                 2214Model:                 ARIMA(1, 1, 1)   Log Likelihood                6796.505Method:                       css-mle   S.D. of innovations              0.011Date:                Tue, 20 Jun 2017   AIC                         -13585.010Time:                        05:37:20   BIC                         -13562.200Sample:                    01-12-1975   HQIC                        -13576.678                         - 06-11-2017                                         =================================================================================                    coef    std err          z      P>|z|      [0.025      0.975]---------------------------------------------------------------------------------const          6.925e-05      0.000      0.223      0.824      -0.001       0.001ar.L1.D.Value     0.1447      0.087      1.666      0.096      -0.026       0.315ma.L1.D.Value     0.1125      0.087      1.286      0.199      -0.059       0.284                                    Roots                                    =============================================================================                 Real           Imaginary           Modulus         Frequency-----------------------------------------------------------------------------AR.1            6.9095           +0.0000j            6.9095            0.0000MA.1           -8.8887           +0.0000j            8.8887            0.5000-----------------------------------------------------------------------------Residuals Summary:                  0count  2214.000000mean      0.000001std       0.011237min      -0.06144625%      -0.00674050%      -0.00025175%       0.006910max       0.063923

ARIMA模型会使拟合和原始数据的残差变为白噪声

In [26]:
acf_pacf(results_ARIMA.fittedvalues-ts_week_log_diff, title='Residuals')

将拟合的值逆向转换并可视化

In [27]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)predictions_ARIMA_diff.head()
Out[27]:
Date1975-01-12    0.0000691975-01-19   -0.0024061975-01-26    0.0010111975-02-02   -0.0040281975-02-09   -0.001174Freq: W-SUN, dtype: float64

差分的逆向转换为累计和
指数的逆向转换为对数

In [28]:
# 一阶差分的逆向转换为累计和predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()# 对数函数的逆向转换为指数函数predictions_ARIMA_log = pd.Series(ts_week_log.iloc [0], index=ts_week_log.index)predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

比较拟合的序列和原序列的均根差, 即为拟合错误

In [29]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)plt.plot(ts_week)plt.plot(predictions_ARIMA)plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts_week)**2)/len(ts_week)))
Out[29]:
<matplotlib.text.Text at 0x23e7bace5c0>

预测并校验模型

这种校验方法称为 one-step out-of-sample forecast
out-of-sample 是指预测时所用样本/观测值不在训练集内

将原序列分为训练和测试集
将训练集设为历史, 此时时间为t
迭代, 迭代次数为测试集的大小:
1. 在历史上拟合ARIMA模型, 参数为先前得到的p,d,q
2. 生成t+1的预测, 并将其加入历史

比较预测值和真实值

In [30]:
from sklearn.metrics import mean_squared_error# 我们的ARIMA模型针对的是序列ts_week_log# 因此使用的数据为ts_week_log# 因此须将预测做指数转换# 使用 n_test_obs 个点校验n_test_obs = 50size = int(len(ts_week_log) - n_test_obs)# 切分数据集train, test = ts_week_log[:size], ts_week_log[size:len(ts_week_log)]# 历史值history = [x for x in train]# 预测值predictions = list()# 置信区间confidence_intervals = list()print('Predicted vs Expected Values...')print('\n')for t in range(len(test)):    model = ARIMA(history, order=(1,1,1))    model_fit = model.fit()    output = model_fit.forecast(steps=1)    yhat = output[0]    predictions.append(float(yhat))    confidence_intervals.append(output[2])    obs = test[t]    history.append(obs)        # 须将预测做指数转换    print('predicted=%f, expected=%f' % (np.exp(yhat), np.exp(obs)))error = mean_squared_error(test, predictions)print('\n')predictions_series = pd.Series(predictions, index = test.index)print('Test MSE: %.6f' % error)
Predicted vs Expected Values...predicted=0.886215, expected=0.902880predicted=0.907080, expected=0.902180predicted=0.901661, expected=0.902660predicted=0.902891, expected=0.908120predicted=0.909561, expected=0.905840predicted=0.905181, expected=0.896540predicted=0.894294, expected=0.897600predicted=0.898145, expected=0.886780predicted=0.884036, expected=0.885600predicted=0.885608, expected=0.896075predicted=0.898827, expected=0.891100predicted=0.889625, expected=0.891240predicted=0.891471, expected=0.893720predicted=0.894386, expected=0.890820predicted=0.890071, expected=0.894120predicted=0.895091, expected=0.904860predicted=0.907586, expected=0.913620predicted=0.915668, expected=0.917360predicted=0.918185, expected=0.903520predicted=0.899997, expected=0.913160predicted=0.916041, expected=0.936860predicted=0.942785, expected=0.945060predicted=0.946632, expected=0.941900predicted=0.941006, expected=0.936860predicted=0.935742, expected=0.948440predicted=0.951627, expected=0.958640predicted=0.961011, expected=0.954200predicted=0.952895, expected=0.951600predicted=0.951158, expected=0.945180predicted=0.943669, expected=0.938440predicted=0.936958, expected=0.932520predicted=0.931239, expected=0.928520predicted=0.927704, expected=0.935940predicted=0.938014, expected=0.942780predicted=0.944393, expected=0.945760predicted=0.946430, expected=0.945500predicted=0.945439, expected=0.943900predicted=0.943574, expected=0.935900predicted=0.933967, expected=0.926700predicted=0.924628, expected=0.927140predicted=0.927547, expected=0.938880predicted=0.941948, expected=0.941850predicted=0.942353, expected=0.933150predicted=0.930947, expected=0.918800predicted=0.915453, expected=0.914425predicted=0.913740, expected=0.917740predicted=0.918738, expected=0.900400predicted=0.895946, expected=0.891920predicted=0.890302, expected=0.890350predicted=0.890181, expected=0.889880Test MSE: 0.000073
In [31]:
predictions_series = pd.Series(predictions, index = test.index)fig, ax = plt.subplots(figsize=(10, 6))ax.set(title='Spot Exchange Rate, Euro into USD', xlabel='Date', ylabel='Euro into USD')ax.plot(ts_week[-n_test_obs:], 'o', label='observed')# 须将预测做指数转换ax.plot(np.exp(predictions_series), 'g', label='rolling one-step out-of-sample forecast')legend = ax.legend(loc='upper left')

可以对季节性建模的 SARIMA 模型

In [32]:
%matplotlib inlineimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport datetimefrom dateutil.relativedelta import relativedeltaimport seaborn as snsimport statsmodels.api as sm  from statsmodels.tsa.stattools import acf  from statsmodels.tsa.stattools import pacffrom statsmodels.tsa.seasonal import seasonal_decomposefrom pandas import DatetimeIndex

Portland Oregon 月平均乘公车人数 (/100) January 1960 至 June 1969, 样本数=114

In [33]:
# 读取CSVdf = pd.read_csv('portland-oregon-average-monthly-.csv')print(df.head(), '\nindex type:\n', type(df.index))# 按'%Y-%m'格式转换CSV的日期, 为了方便处理, 将时间段转为timestampdf['Month'] = pd.to_datetime(df['Month'], format='%Y-%m')# 索引并resample为月indexed_df = df.set_index('Month')ts = indexed_df['riders']ts = ts.resample('M').sum()print(ts.head(), '\nindex type:\n', type(ts.index))
     Month  riders0  1960-01     6481  1960-02     6462  1960-03     6393  1960-04     6544  1960-05     630 index type: <class 'pandas.core.indexes.range.RangeIndex'>Month1960-01-31    6481960-02-29    6461960-03-31    6391960-04-30    6541960-05-31    630Freq: M, Name: riders, dtype: int64 index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

可以看出数据有很明显的上升趋势和季节性(每12个月)

In [34]:
ts.plot(title='Monthly Num. of Ridership')test_stationarity(ts)decomposition = seasonal_decompose(ts)  fig = plt.figure()  fig = decomposition.plot()  fig.set_size_inches(10, 6)acf_pacf(decomposition.resid, title='Residuals')
Results of Augmented Dickey-Fuller Test:Test Statistic            -1.536597p-value                    0.515336Num Lags Used             12.000000Num Observations Used    101.000000Critical Value (1%)       -3.496818Critical Value (10%)      -2.582277Critical Value (5%)       -2.890611dtype: float64
<matplotlib.figure.Figure at 0x23e00b5e128>

平稳化序列

做季节差分, shift参数为12, 即12个时间窗口, 12个月

In [35]:
ts_sdiff = ts - ts.shift(12)test_stationarity(ts_sdiff.dropna())decomposition = seasonal_decompose(ts_sdiff.dropna(), freq=12)  fig = plt.figure()  fig = decomposition.plot()  fig.set_size_inches(10, 6)acf_pacf(decomposition.resid, title='Residuals')
Results of Augmented Dickey-Fuller Test:Test Statistic           -2.469741p-value                   0.123011Num Lags Used             3.000000Num Observations Used    98.000000Critical Value (1%)      -3.498910Critical Value (10%)     -2.582760Critical Value (5%)      -2.891516dtype: float64
<matplotlib.figure.Figure at 0x23e009d1ef0>

用差分去除季节差分后的趋势, 即季节差分的第一差分
注意第一差分后趋势的数量级
得到平稳的时间序列

In [36]:
ts_diff_of_sdiff = ts_sdiff - ts_sdiff.shift(1)test_stationarity(ts_diff_of_sdiff.dropna())decomposition = seasonal_decompose(ts_diff_of_sdiff.dropna(), freq=12)  fig = plt.figure()  fig = decomposition.plot()  fig.set_size_inches(10, 6)acf_pacf(decomposition.resid, title='Residuals')
Results of Augmented Dickey-Fuller Test:Test Statistic          -9.258520e+00p-value                  1.427874e-15Num Lags Used            0.000000e+00Num Observations Used    1.000000e+02Critical Value (1%)     -3.497501e+00Critical Value (10%)    -2.582435e+00Critical Value (5%)     -2.890906e+00dtype: float64
<matplotlib.figure.Figure at 0x23e04b76748>

指数转换不能使序列更平稳

In [37]:
ts_log = np.log(ts)ts_log_sdiff = ts_log - ts_log.shift(12)ts_diff_of_log_sdiff = ts_log_sdiff - ts_log_sdiff.shift(1)test_stationarity(ts_diff_of_log_sdiff.dropna())decomposition = seasonal_decompose(ts_diff_of_log_sdiff.dropna(), freq=12)  fig = plt.figure()  fig = decomposition.plot()  fig.set_size_inches(10, 6)acf_pacf(decomposition.resid, title='Residuals')
Results of Augmented Dickey-Fuller Test:Test Statistic          -8.882112e+00p-value                  1.309452e-14Num Lags Used            0.000000e+00Num Observations Used    1.000000e+02Critical Value (1%)     -3.497501e+00Critical Value (10%)    -2.582435e+00Critical Value (5%)     -2.890906e+00dtype: float64
<matplotlib.figure.Figure at 0x23e02c95358>

SARIMA的参数 p, d, q, P, D, Q

已经得知 d=1, D=1 得到平稳的时间序列后需要决定p和q

p: 如果PACF(lag<=k)显著且PACF(lag>k)均不显著, 考虑将p设为k
q: 如果ACF(lag<=k)显著且ACF(lag>k)均不显著, 考虑将q设为k

p: 如果差分后序列的PACF迅速衰减到0而ACF在lag较高处仍有显著的较强的相关, 和/或 ACF(lag=1) > 0, 考虑增加p
q: 如果差分后序列的ACF迅速衰减到0, 和/或 ACF(lag=1) < 0, 考虑增加q

P: 如果季节差分后序列的ACF(lag=季节周期)显著为正, 考虑增加P
Q: 如果季节差分后序列的ACF(lag=季节周期)显著为负, 考虑增加Q

绘出拟合后的ARIMA的拟合残差的ACF和PACF, 按上述规则调整p, q, P, Q
通常P, Q最多为1

这里考虑p=0, q=0
季节周期=12, 季节差分后的ACF(12)显著为负, 可以考虑P=0, Q=1

In [38]:
acf_pacf(ts_diff_of_sdiff, title='ts_diff_of_sdiff')

拟合SARIMA

In [39]:
from statsmodels.tsa.statespace.sarimax import SARIMAX# SARIMA的参数: order = p, d, q, seasonal_order=P, D, Q, season的周期=12model = SARIMAX(ts, order=(0,1,0),seasonal_order=(0,1,1,12))# 已拟合的模型fitted = model.fit()  # ARIMA的结果print(fitted.summary())# 比较拟合的数据和原始数据,ts.plot()fitted.fittedvalues.plot(color='red')# 拟合的数据和原始数据的残差residuals = pd.DataFrame(fitted.resid)acf_pacf(residuals, title='Residuals', figsize=(10,4))# 计算拟残差平方和(RSS)plt.title('RSS: %.4f'% sum((fitted.fittedvalues-ts)**2))# 残差的核密度估计residuals.plot(kind='kde')print('\n\nResiduals Summary:\n', residuals.describe())plt.show()
                                 Statespace Model Results                                 ==========================================================================================Dep. Variable:                             riders   No. Observations:                  114Model:             SARIMAX(0, 1, 0)x(0, 1, 1, 12)   Log Likelihood                -504.683Date:                            Tue, 20 Jun 2017   AIC                           1013.365Time:                                    05:37:32   BIC                           1018.838Sample:                                01-31-1960   HQIC                          1015.586                                     - 06-30-1969                                         Covariance Type:                              opg                                         ==============================================================================                 coef    std err          z      P>|z|      [0.025      0.975]------------------------------------------------------------------------------ma.S.L12      -0.6937      0.118     -5.867      0.000      -0.925      -0.462sigma2      1185.5644    183.539      6.459      0.000     825.834    1545.295===================================================================================Ljung-Box (Q):                       43.16   Jarque-Bera (JB):                 2.01Prob(Q):                              0.34   Prob(JB):                         0.37Heteroskedasticity (H):               1.49   Skew:                             0.28Prob(H) (two-sided):                  0.25   Kurtosis:                         3.39===================================================================================Warnings:[1] Covariance matrix calculated using the outer product of gradients (complex-step).Residuals Summary:                 0count  114.000000mean     0.920410std     72.659634min   -214.01800125%    -22.89838450%     -5.57603175%     18.514757max    648.000000

预测并校验模型

这种校验方法称为 one-step out-of-sample forecast
out-of-sample 是指预测时所用样本/观测值不在训练集内

将原序列分为训练和测试集
将训练集设为历史, 此时时间为t
迭代, 迭代次数为测试集的大小:
1. 在历史上拟合SARIMA模型, 参数为先前得到的p,d,q,P,D,Q
2. 生成t+1的预测, 并将其加入历史

比较预测值和真实值

In [40]:
from sklearn.metrics import mean_squared_error# 我们的ARIMA模型针对的是序列ts_week_log# 因此使用的数据为ts_week_log# 因此须将预测做指数转换# 使用 n_test_obs 个点校验n_test_obs = 20size = int(len(ts) - n_test_obs)# 切分数据集train, test = ts[:size], ts[size:len(ts)]# 历史值history = [x for x in train]# 预测值predictions = list()# 置信区间confidence_intervals = list()print('Predicted vs Expected Values...')print('\n')for t in range(len(test)):    model = SARIMAX(history, order=(0,1,0),seasonal_order=(0,1,1,12))    model_fit = model.fit()    output = model_fit.forecast(steps=1)    yhat = output[0]    predictions.append(float(yhat))    obs = test[t]    history.append(obs)        print('predicted=%f, expected=%f' % (yhat, obs))error = mean_squared_error(test, predictions)print('\n')predictions_series = pd.Series(predictions, index = test.index)print('Test MSE: %.6f' % error)predictions_series = pd.Series(predictions, index = test.index)fig, ax = plt.subplots(figsize=(10, 6))ax.set(title='Bus Riders', xlabel='Month', ylabel='100 Bus Riders')ax.plot(ts[-n_test_obs:], 'o', label='observed')ax.plot(predictions_series, 'g', label='rolling one-step out-of-sample forecast')legend = ax.legend(loc='upper left')
Predicted vs Expected Values...predicted=1475.509981, expected=1424.000000predicted=1367.194981, expected=1360.000000predicted=1441.147682, expected=1429.000000predicted=1461.899309, expected=1440.000000predicted=1406.502132, expected=1414.000000predicted=1438.593286, expected=1424.000000predicted=1396.225163, expected=1408.000000predicted=1386.002350, expected=1337.000000predicted=1299.644308, expected=1258.000000predicted=1214.892724, expected=1214.000000predicted=1324.750929, expected=1326.000000predicted=1353.201730, expected=1417.000000predicted=1424.747027, expected=1417.000000predicted=1357.550530, expected=1329.000000predicted=1407.819886, expected=1461.000000predicted=1483.069065, expected=1425.000000predicted=1392.212819, expected=1419.000000predicted=1441.571043, expected=1432.000000predicted=1408.266795, expected=1394.000000predicted=1357.971777, expected=1327.000000Test MSE: 1049.765808
In [ ]:
 

原创粉丝点击