Refs:

Introduction

Time series are different than usual dataseries because there usually contain periodic patterns (weekly, yearly…). To find these patterns it’s needed different types of analysis, since instead of assuming the sequence of observations does not matter, we are assuming that it matters, old observations help predict new ones.

Time series \(X_t\) usually have two or three components

A time series with seasonal component is called seasoned data (eg, sales per month). Some data series does not have a seasonal component (eg, a population mortality average).

R has a class time series named ts:

my.data <- round( sin( (0:47 %% 12)+1 ) + runif(48)  ,1) # some periodic data
# make a time series with 48 observations from January 2009 to December 2014
my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12)
my.ts
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2009  1.6  1.0  0.2  0.0 -0.2  0.1  0.9  1.4  0.8 -0.4 -0.8  0.1
## 2010  1.8  1.2  0.3 -0.1 -0.2 -0.3  1.5  1.9  1.2  0.1 -0.6 -0.5
## 2011  1.3  1.1  0.7  0.2  0.0  0.7  1.3  1.7  0.6 -0.5 -0.2  0.1
## 2012  1.1  1.0  0.5  0.1 -0.9  0.6  1.1  2.0  0.5  0.4 -0.8 -0.2
## 2013  1.6  1.0  0.2  0.0 -0.2  0.1  0.9  1.4  0.8 -0.4 -0.8  0.1
## 2014  1.8  1.2  0.3 -0.1 -0.2 -0.3  1.5  1.9  1.2  0.1 -0.6 -0.5
plot(my.ts)

ts(my.data, end=c(2009,1), frequency=4) # ts is smart and go back in time if we just give the 'end' parameter
##      Qtr1 Qtr2 Qtr3 Qtr4
## 1997       1.6  1.0  0.2
## 1998  0.0 -0.2  0.1  0.9
## 1999  1.4  0.8 -0.4 -0.8
## 2000  0.1  1.8  1.2  0.3
## 2001 -0.1 -0.2 -0.3  1.5
## 2002  1.9  1.2  0.1 -0.6
## 2003 -0.5  1.3  1.1  0.7
## 2004  0.2  0.0  0.7  1.3
## 2005  1.7  0.6 -0.5 -0.2
## 2006  0.1  1.1  1.0  0.5
## 2007  0.1 -0.9  0.6  1.1
## 2008  2.0  0.5  0.4 -0.8
## 2009 -0.2

window makes a subset of a timeseries:

plot( window(my.ts, start=c(2014, 6), end=c(2014, 12)) )

Other operations:

time(my.ts)                # creates the vector of times at which a time series was sampled.
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2009 2009.000 2009.083 2009.167 2009.250 2009.333 2009.417 2009.500
## 2010 2010.000 2010.083 2010.167 2010.250 2010.333 2010.417 2010.500
## 2011 2011.000 2011.083 2011.167 2011.250 2011.333 2011.417 2011.500
## 2012 2012.000 2012.083 2012.167 2012.250 2012.333 2012.417 2012.500
## 2013 2013.000 2013.083 2013.167 2013.250 2013.333 2013.417 2013.500
## 2014 2014.000 2014.083 2014.167 2014.250 2014.333 2014.417 2014.500
##           Aug      Sep      Oct      Nov      Dec
## 2009 2009.583 2009.667 2009.750 2009.833 2009.917
## 2010 2010.583 2010.667 2010.750 2010.833 2010.917
## 2011 2011.583 2011.667 2011.750 2011.833 2011.917
## 2012 2012.583 2012.667 2012.750 2012.833 2012.917
## 2013 2013.583 2013.667 2013.750 2013.833 2013.917
## 2014 2014.583 2014.667 2014.750 2014.833 2014.917
cycle(my.ts)               # gives the positions in the cycle of each observation.
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009   1   2   3   4   5   6   7   8   9  10  11  12
## 2010   1   2   3   4   5   6   7   8   9  10  11  12
## 2011   1   2   3   4   5   6   7   8   9  10  11  12
## 2012   1   2   3   4   5   6   7   8   9  10  11  12
## 2013   1   2   3   4   5   6   7   8   9  10  11  12
## 2014   1   2   3   4   5   6   7   8   9  10  11  12
ts1 <- lag(my.ts, k=12)    # lagged version of time series, shifted back k observations
plot(my.ts, , lwd=2, main="comparisation with next year")
points(ts1, type="l", col="red")

ds <- diff(my.ts, d=1)     # difference vector the time series, d times
plot( ds )               

Preliminary Analysis of Time Series

tui <- read.csv("tui.csv", header=T, dec=",", sep=";")
head(tui)
##       date  open  high   low close volume
## 1 20020514 29.01 29.29 28.70 29.00 422606
## 2 20020513 28.81 29.15 28.71 28.93 265548
## 3 20020510 29.30 29.49 28.38 28.69 497579
## 4 20020509 29.26 29.78 28.82 29.01 342824
## 5 20020508 28.50 29.80 28.41 29.70 559890
## 6 20020507 29.50 29.65 27.85 28.21 716809
plot(tui[,5], type="l",lwd=2, col="red", xlab="time", ylab="closing values", main="Stock data of TUI AG", ylim=c(0,60) )

hist(diff(tui[,5]),prob=T,ylim=c(0,0.65),xlim=c(-4.5,4.5),col="grey", breaks=20, main=" histogram of the differences")
lines(density(diff(tui[,5])),lwd=2)
points(seq(-4.5,4.5,len=100),dnorm(seq(-4.5,4.5,len=100), mean(diff(tui[,5])), sd(diff(tui[,5]))), col="red", type="l",lwd=2)

The Kolgomorov-Smirnoff test checks is a sample – in this case the differences between consecutive values of the time series – follows a specific distribution:

ds <- diff(log(tui[,5]))
ks.test(ds, "pnorm", mean(ds), sd(ds))          # it seems so
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  ds
## D = 0.098003, p-value = 2.487e-05
## alternative hypothesis: two-sided
qqnorm(diff(tui[,5])); abline(0,1,col="red")  # another normality test, this time visual

shapiro.test(ds)  # test for normality (should fail for ds with p-value >= 0.05)
## 
##  Shapiro-Wilk normality test
## 
## data:  ds
## W = 0.80962, p-value < 2.2e-16

Linear Filtering

A common method to extract the trend component \(T_t\) from time series \(X_t\) is to apply filters, \[T_t = \sum_{i=-\infty}^{+\infty} \lambda_iX_{t+i}\]

A common method is moving averages: \[T_t = \frac{1}{2a+1} \sum_{-a}^a X_{t+i}\]

In R this is done with filter:

a <- 20; tui.ma1 <- filter(tui[,5], filter=rep(1/a,a)) # check also package TTR about functions SMA et al.
a <- 50; tui.ma2 <- filter(tui[,5], filter=rep(1/a,a)) 
ts.plot(tui[,5], tui.ma1, tui.ma2, col=1:3, lwd=c(1,2,2))

Function stl performs a seasonal decomposition of \(X_t\) by determining \(T_t\) using a loess regression (linear regression ‘plus’ k-nearest-neighbors), and then calculating the seasonal component \(S_t\) and residuals \(e_t\) from the differences \(X_t-T_t\).

An eg:

beer <- read.csv("beer.csv", header=T, dec=",", sep=";")
beer <- ts(beer[,1],start=1956,freq=12)
plot(stl(log(beer), s.window="periodic"))

Using Linear Regression

Let’s say we would want to fit the beer time series by the following model:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + e_t\]

logbeer <- log(beer)

t  <- seq(1956, 1995.2, len=length(beer))
t2 <- t^2
model <- lm(logbeer ~ t + t2)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)

But we are not considering the seasonal component. So let’s improve the model adding the first Fourier harmonics, \(cos(2\pi t/P)\) and \(sin(2\pi t/P\) where \(P\) is \(12\) given the data is in months:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \beta \cos(\frac{2\pi}{12}) + \gamma \sin(\frac{2\pi}{12}) + e_t\]

cos.t <- cos(2*pi*t) # the period P=12 is already included in the time series
sin.t <- sin(2*pi*t)
model <- lm(logbeer ~ t + t2 + cos.t + sin.t)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)

We can check what coefficients are significant:

summary(model)
## 
## Call:
## lm(formula = logbeer ~ t + t2 + cos.t + sin.t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33191 -0.08655 -0.00314  0.08177  0.34517 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.833e+03  1.841e+02 -20.815   <2e-16 ***
## t            3.868e+00  1.864e-01  20.751   <2e-16 ***
## t2          -9.748e-04  4.718e-05 -20.660   <2e-16 ***
## cos.t       -1.246e-02  7.669e-03  -1.624    0.105    
## sin.t       -1.078e-01  7.679e-03 -14.036   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1184 on 471 degrees of freedom
## Multiple R-squared:  0.8017, Adjusted R-squared:    0.8 
## F-statistic: 476.1 on 4 and 471 DF,  p-value: < 2.2e-16

The only coefficient that seems to not differ significantly from zero is the cosine component.

Exponential Smoothing

One way to estimate the next value of a time series \(x_t\) is \[\hat{x} = \lambda_1 x_{t-1} + \lambda_2 x_{t-2} + \ldots\]

It seems reasonable to weight recent observations more than observations less recent. One possibility is to use geometric weights: \[\lambda_i = \alpha(1-\alpha)^i;~~ 0 \lt \alpha \lt 1\]

This is called exponential smoothing and, in its basic form, should be used with time series with no trend or seasonal components. This method, however, has been extended to the Holt-Winters smoothing that accepts time series with those components.

This method has three parameters, \(\alpha, \beta, \gamma\). When \(\gamma=FALSE\) the function assumes no seasonal component.

model <- HoltWinters(beer)
plot(beer)
lines(model$fitted[,"xhat"], col="red")

To predict the next values in the time series:

pred <- predict(model, n.ahead=24) # values for the next 24 months
plot(beer, xlim=c(1956,1997))
lines(pred, col="red", lty=2)

An eg with the tui time series which does not seem to have a seasonal component:

model <- HoltWinters(tui[,5], gamma=FALSE)
plot(tui[,5], type="l",xlim=c(0,length(tui[,5])+36))
lines(model$fitted[,"xhat"], col="red")
pred <- predict(model, n.ahead=36)
lines(pred, col="red", lty=2)

Function forecast adds a significance interval to the prediction:

library(forecast)
plot(forecast(model, h=40, level=c(75,95))) # the next 40 data points using 75% and 95% confidence intervals

There is also function ets that returns an exponential smoothing model:

fit <- ets(my.ts) # Automated forecasting using an exponential model
plot(forecast(fit, level=c(75,95))) # using 75% and 95% confidence intervals

Detrending

It’s possible to remove the trend component using decompose or stl:

report <- decompose(beer, type="multiplicative")
plot(beer - report$trend, main="signal without trend component")

plot(report$seasonal, main="signal without trend and irregular components")

report <- stl(beer, s.window="periodic")
plot(beer - report$time.series[,"trend"], type="l", main="signal without trend component")

plot(report$time.series[,"seasonal"], type="l", main="signal without trend and irregular components")

Those functions only work with detectable seasonal signals. It’s still possible to detrend using the residuals of a linear regression:

idxs  <- 1:nrow(tui)
trend <- lm(tui[,"close"] ~ idxs)
plot(tui[,5], type="l")
abline(trend, col="red", lwd=2)

detrended <- trend$residuals
plot(detrended, type="l")

We can also use Fourier Analysis to find some pattern in the seasonal (plus irregular) component:

library(GeneCycle)

f.data <- GeneCycle::periodogram(detrended)
harmonics <- 1:30   # find patterns up to one month
plot(f.data$freq[harmonics]*length(detrended), 
     f.data$spec[harmonics]/sum(f.data$spec), 
     xlab="Harmonics (Hz)", ylab="Amplitute Density", type="h")

Stationary Time Series

A stationary time series is one that its statistical properties like mean, variance and autocorrelation, are not time dependent. This means that this type of time series are easy to predict, since all important statistical properties remain constant. Stationarity is usually required to build a time series model (like the ARIMA models we’ll see next).

There are techniques, like detrending, that can transform a time series to become approximately stationary (which might not work well, of course). Then, after processing the predictions, the ‘stationarized’ time series can be transformed back to the original series.

Even after detrending (that removes the trend component), there might be some season component preventing the time series to be stationary. In these cases, perhaps the statistics of the changes between seasons will be constant. If that is the case, the time series is called difference-stationary, and we can emply another technique: differencing.

The i-th difference is given by \(X_n - X_{n-i}\) and is computed by diff.

d1 <- diff(detrended, diff=1) # first difference
plot(d1, type="l")

d2 <- diff(detrended, diff=2) # second difference
plot(d2, type="l")

A stationary time series can be modeled as a autoregressive model of order \(p\):

\[X_n = a_1 X_{n-1} + a_2 X_{n-2} + \cdots + a_p X_{n-p} + \epsilon_t\]

where \(\epsilon_t\) are iid with mean zero and constant variance.

If this stochastic process has a single unit root then the time series in non-stationary. Its moments depend on \(t\).

The Dickey-Fuller test is a frequentist test for the null that x has a unit root:

library(fpp)
## Loading required package: fma
## Loading required package: tseries
## 
## Attaching package: 'fma'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     beer
## 
## The following objects are masked from 'package:MASS':
## 
##     cement, housing, petrol
## 
## Loading required package: expsmooth
## Loading required package: lmtest
# non stationary time series: non constant mean and variance
rdata <- sapply(seq(.1,1,len=250), function(x) rnorm(1,10*x,2*x))
plot(rdata,type="l")

adf.test(rdata, alternative="stationary") # small p-values suggest data is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  rdata
## Dickey-Fuller = -6.5207, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

However, in the previous case, it does not seem to work (I must be doing it wrong).

ARIMA models

ARIMA means Autoregressive integrated moving average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. ARIMA models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component.

It consists of three stages

  1. Model Identification

  2. Parameter Estimation

  3. Diagnostic Checking

These stages are repeated until a ‘suitable’ model is identified.

ARIMA models have three parameters, ARIMA(\(p\),\(d\),\(q\)). Parameter \(p\) concerns the AR part (the autoregression), parameter \(d\) the I part, and \(q\) the MA (moving average).

For this eg the data is from the age of death of 42 successive kings of England:

kings <- scan("kings.dat",skip=3)
plot(kings, type="l", main="Age of death of 42 successive kings of England")

ARIMA models are defined for stationary time series, ie, the statistical properties – mean and variance in our case – do not change over time. If you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an ARIMA(p,d,q) model, where d is the order of differencing used.

Box.test tests whether there is significant evidence for non-zero correlations at lags autocorrelation coefficients. Small p-values (i.e., less than 0.05) suggest that the series is stationary.

kings.dif <- diff(kings, d=1)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 11.32, df = 10, p-value = 0.3332
kings.dif <- diff(kings, d=2)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 16.387, df = 10, p-value = 0.08907
kings.dif <- diff(kings, d=3)
Box.test(kings.dif, lag=10)
## 
##  Box-Pierce test
## 
## data:  kings.dif
## X-squared = 21.914, df = 10, p-value = 0.01555
plot(kings.dif, type="l")

# other tests
library(fpp)
adf.test(kings.dif, alternative="stationary") # small p-values suggest the data is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  kings.dif
## Dickey-Fuller = -8.3516, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(kings.dif) # small p-values suggest that the series is not stationary and a differencing is required
## 
##  KPSS Test for Level Stationarity
## 
## data:  kings.dif
## KPSS Level = 0.035441, Truncation lag parameter = 1, p-value = 0.1

So, in this case we are going to use \(d=3\), this means ARIMA(\(p\),3,\(q\)) models.

The next step is to find appropriate values for \(p\) and \(q\). For this we start by plotting the a correlogram and partial correlogram of the time-series, which is done by acf and pacf. The get the values use function parameter plot=FALSE:

acf(kings.dif,  lag.max=20)

acf(kings.dif,  lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'kings.dif', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000 -0.656  0.186 -0.106  0.163 -0.034 -0.136  0.128 -0.020  0.025 
##     10     11     12     13     14     15     16     17     18     19 
## -0.147  0.187 -0.050 -0.098  0.071 -0.010  0.085 -0.160  0.082  0.024 
##     20 
## -0.015
pacf(kings.dif, lag.max=20)

pacf(kings.dif, lag.max=20, plot=FALSE)
## 
## Partial autocorrelations of series 'kings.dif', by lag
## 
##      1      2      3      4      5      6      7      8      9     10 
## -0.656 -0.430 -0.460 -0.270  0.063 -0.015 -0.005  0.030  0.104 -0.097 
##     11     12     13     14     15     16     17     18     19     20 
## -0.027  0.115 -0.045 -0.081 -0.132 -0.004  0.023 -0.008 -0.011 -0.029

Only the correlation at lag \(1\) exceeds the significance bounds. The partial correlogram shows that the partial autocorrelations at lags 1, 2 and 3 exceed the significance bounds, are negative, and are slowly decreasing in magnitude with increasing lag.

Since the correlogram is zero after lag 1, and the partial correlogram tails off to zero after lag 3, this means that the following ARMA (autoregressive moving average) models are possible for the time series of third differences:

We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best. The ARMA(3,0) model has 3 parameters, the ARMA(0,1) model has 1 parameter, and the ARMA(p,q) model has at least 2 parameters. Therefore, the ARMA(0,1) model is taken as the best model.

Finally the chosen model with be an ARIMA(0,3,1).

There is a R function (surprise!) that finds an appropriate ARIMA model:

library(forecast)
report <- auto.arima(kings) #  parameter ic="bic"" penalises the number of parameters
report
## Series: kings 
## ARIMA(0,1,1)                    
## 
## Coefficients:
##           ma1
##       -0.7218
## s.e.   0.1208
## 
## sigma^2 estimated as 230.4:  log likelihood=-170.06
## AIC=344.13   AICc=344.44   BIC=347.56

Knowing the parameters we use arima to fit this model to the original time series:

kings.arima <- arima(kings, order=c(0,1,1))
kings.arima
## 
## Call:
## arima(x = kings, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.7218
## s.e.   0.1208
## 
## sigma^2 estimated as 230.4:  log likelihood = -170.06,  aic = 344.13
kings.forecast <- forecast.Arima(kings.arima, h=5, level=c(.75, .90, .99))  # confidence levels
kings.forecast
##    Point Forecast    Lo 75    Hi 75    Lo 90    Hi 90    Lo 99    Hi 99
## 43       67.75063 50.28814 85.21312 42.78149 92.71977 28.64913 106.8521
## 44       67.75063 49.62481 85.87645 41.83301 93.66825 27.16382 108.3374
## 45       67.75063 48.98491 86.51634 40.91804 94.58322 25.73098 109.7703
## 46       67.75063 48.36613 87.13513 40.03325 95.46801 24.34541 111.1558
## 47       67.75063 47.76649 87.73477 39.17585 96.32541 23.00272 112.4985
plot.forecast(kings.forecast)

It is a good idea to investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors.

acf(kings.forecast$residuals, lag=20) # seems ok

Box.test(kings.forecast$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  kings.forecast$residuals
## X-squared = 13.584, df = 20, p-value = 0.8509

Since the p-value for the Ljung-Box test is 0.85, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.

# investigate whether the forecast errors are normally distributed with mean zero and constant variance
plotForecastErrors <- function(forecasterrors)
  {
     # make a histogram of the forecast errors:
     mybinsize <- IQR(forecasterrors)/4
     mysd   <- sd(forecasterrors)
     mymin  <- min(forecasterrors) - mysd*5
     mymax  <- max(forecasterrors) + mysd*3
     # generate normally distributed data with mean 0 and standard deviation mysd
     mynorm <- rnorm(10000, mean=0, sd=mysd)
     mymin2 <- min(mynorm)
     mymax2 <- max(mynorm)
     if (mymin2 < mymin) { mymin <- mymin2 }
     if (mymax2 > mymax) { mymax <- mymax2 }
     # make a red histogram of the forecast errors, with the normally distributed data overlaid:
     mybins <- seq(mymin, mymax, mybinsize)
     hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
     # freq=FALSE ensures the area under the histogram = 1
     # generate normally distributed data with mean 0 and standard deviation mysd
     myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
     # plot the normal curve as a blue line on top of the histogram of forecast errors:
     points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
  }

plotForecastErrors(kings.forecast$residuals)

Autoregressive model

Find an autoregressive model for the time series, and predict future observations:

my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12) # using the sine data
ar.model <- ar(my.ts)       
pred <- predict(ar.model, n.ahead=12)
ts.plot(my.ts, pred$pred, lty=c(1:2), col=1:2)

In conclusion

The steps to make a time series analysis are:

  1. Visualize the time series to intuitively check for trends

  2. If needed, stationarize the time series using techniques like detrending and differencing

  3. Find appropriate parameters and build ARIMA models

  4. With the built model, make predictions of the future