Refs:

Introduction

Time series are different than usual dataseries because there usually contain periodic patterns (weekly, yearly…). To find these patterns it’s needed different types of analysis, since instead of assuming the sequence of observations does not matter, we are assuming that it matters, old observations help predict new ones.

Time series \(X_t\) usually have two or three components

A time series with seasonal component is called seasoned data (eg, sales per month). Some data series does not have a seasonal component (eg, a population mortality average).

R has a class time series named ts:

my.data <- round( sin( (0:47 %% 12)+1 ) + runif(48)  ,1) # some periodic data
# make a time series with 48 observations from January 2009 to December 2014
my.ts <- ts(my.data, start=c(2009, 1), end=c(2014, 12), frequency=12)
my.ts
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2009  1.6  1.0  0.2  0.0 -0.2  0.1  0.9  1.4  0.8 -0.4 -0.8  0.1
## 2010  1.8  1.2  0.3 -0.1 -0.2 -0.3  1.5  1.9  1.2  0.1 -0.6 -0.5
## 2011  1.3  1.1  0.7  0.2  0.0  0.7  1.3  1.7  0.6 -0.5 -0.2  0.1
## 2012  1.1  1.0  0.5  0.1 -0.9  0.6  1.1  2.0  0.5  0.4 -0.8 -0.2
## 2013  1.6  1.0  0.2  0.0 -0.2  0.1  0.9  1.4  0.8 -0.4 -0.8  0.1
## 2014  1.8  1.2  0.3 -0.1 -0.2 -0.3  1.5  1.9  1.2  0.1 -0.6 -0.5
plot(my.ts)

ts(my.data, end=c(2009,1), frequency=4) # ts is smart and go back in time if we just give the 'end' parameter
##      Qtr1 Qtr2 Qtr3 Qtr4
## 1997       1.6  1.0  0.2
## 1998  0.0 -0.2  0.1  0.9
## 1999  1.4  0.8 -0.4 -0.8
## 2000  0.1  1.8  1.2  0.3
## 2001 -0.1 -0.2 -0.3  1.5
## 2002  1.9  1.2  0.1 -0.6
## 2003 -0.5  1.3  1.1  0.7
## 2004  0.2  0.0  0.7  1.3
## 2005  1.7  0.6 -0.5 -0.2
## 2006  0.1  1.1  1.0  0.5
## 2007  0.1 -0.9  0.6  1.1
## 2008  2.0  0.5  0.4 -0.8
## 2009 -0.2

window makes a subset of a timeseries:

plot( window(my.ts, start=c(2014, 6), end=c(2014, 12)) )

Other operations:

time(my.ts)                # creates the vector of times at which a time series was sampled.
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2009 2009.000 2009.083 2009.167 2009.250 2009.333 2009.417 2009.500
## 2010 2010.000 2010.083 2010.167 2010.250 2010.333 2010.417 2010.500
## 2011 2011.000 2011.083 2011.167 2011.250 2011.333 2011.417 2011.500
## 2012 2012.000 2012.083 2012.167 2012.250 2012.333 2012.417 2012.500
## 2013 2013.000 2013.083 2013.167 2013.250 2013.333 2013.417 2013.500
## 2014 2014.000 2014.083 2014.167 2014.250 2014.333 2014.417 2014.500
##           Aug      Sep      Oct      Nov      Dec
## 2009 2009.583 2009.667 2009.750 2009.833 2009.917
## 2010 2010.583 2010.667 2010.750 2010.833 2010.917
## 2011 2011.583 2011.667 2011.750 2011.833 2011.917
## 2012 2012.583 2012.667 2012.750 2012.833 2012.917
## 2013 2013.583 2013.667 2013.750 2013.833 2013.917
## 2014 2014.583 2014.667 2014.750 2014.833 2014.917
cycle(my.ts)               # gives the positions in the cycle of each observation.
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2009   1   2   3   4   5   6   7   8   9  10  11  12
## 2010   1   2   3   4   5   6   7   8   9  10  11  12
## 2011   1   2   3   4   5   6   7   8   9  10  11  12
## 2012   1   2   3   4   5   6   7   8   9  10  11  12
## 2013   1   2   3   4   5   6   7   8   9  10  11  12
## 2014   1   2   3   4   5   6   7   8   9  10  11  12
ts1 <- lag(my.ts, k=12)    # lagged version of time series, shifted back k observations
plot(my.ts, , lwd=2, main="comparisation with next year")
points(ts1, type="l", col="red")

ds <- diff(my.ts, d=1)     # difference vector the time series, d times
plot( ds )               

Preliminary Analysis of Time Series

tui <- read.csv("tui.csv", header=T, dec=",", sep=";")
head(tui)
##       date  open  high   low close volume
## 1 20020514 29.01 29.29 28.70 29.00 422606
## 2 20020513 28.81 29.15 28.71 28.93 265548
## 3 20020510 29.30 29.49 28.38 28.69 497579
## 4 20020509 29.26 29.78 28.82 29.01 342824
## 5 20020508 28.50 29.80 28.41 29.70 559890
## 6 20020507 29.50 29.65 27.85 28.21 716809
plot(tui[,5], type="l",lwd=2, col="red", xlab="time", ylab="closing values", main="Stock data of TUI AG", ylim=c(0,60) )

hist(diff(tui[,5]),prob=T,ylim=c(0,0.65),xlim=c(-4.5,4.5),col="grey", breaks=20, main=" histogram of the differences")
lines(density(diff(tui[,5])),lwd=2)
points(seq(-4.5,4.5,len=100),dnorm(seq(-4.5,4.5,len=100), mean(diff(tui[,5])), sd(diff(tui[,5]))), col="red", type="l",lwd=2)

The Kolgomorov-Smirnoff test checks is a sample – in this case the differences between consecutive values of the time series – follows a specific distribution:

ds <- diff(log(tui[,5]))
ks.test(ds, "pnorm", mean(ds), sd(ds))          # it seems so
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  ds
## D = 0.098003, p-value = 2.487e-05
## alternative hypothesis: two-sided
qqnorm(diff(tui[,5])); abline(0,1,col="red")  # another normality test, this time visual

shapiro.test(ds)  # test for normality (should fail for ds with p-value >= 0.05)
## 
##  Shapiro-Wilk normality test
## 
## data:  ds
## W = 0.80962, p-value < 2.2e-16

Linear Filtering

A common method to extract the trend component \(T_t\) from time series \(X_t\) is to apply filters, \[T_t = \sum_{i=-\infty}^{+\infty} \lambda_iX_{t+i}\]

A common method is moving averages: \[T_t = \frac{1}{2a+1} \sum_{-a}^a X_{t+i}\]

In R this is done with filter:

a <- 20; tui.ma1 <- filter(tui[,5], filter=rep(1/a,a)) # check also package TTR about functions SMA et al.
a <- 50; tui.ma2 <- filter(tui[,5], filter=rep(1/a,a)) 
ts.plot(tui[,5], tui.ma1, tui.ma2, col=1:3, lwd=c(1,2,2))

Function stl performs a seasonal decomposition of \(X_t\) by determining \(T_t\) using a loess regression (linear regression ‘plus’ k-nearest-neighbors), and then calculating the seasonal component \(S_t\) and residuals \(e_t\) from the differences \(X_t-T_t\).

An eg:

beer <- read.csv("beer.csv", header=T, dec=",", sep=";")
beer <- ts(beer[,1],start=1956,freq=12)
plot(stl(log(beer), s.window="periodic"))

Using Linear Regression

Let’s say we would want to fit the beer time series by the following model:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + e_t\]

logbeer <- log(beer)

t  <- seq(1956, 1995.2, len=length(beer))
t2 <- t^2
model <- lm(logbeer ~ t + t2)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)

But we are not considering the seasonal component. So let’s improve the model adding the first Fourier harmonics, \(cos(2\pi t/P)\) and \(sin(2\pi t/P\) where \(P\) is \(12\) given the data is in months:

\[log(X_t) = \alpha_0 + \alpha_1 t + \alpha_2 t^2 + \beta \cos(\frac{2\pi}{12}) + \gamma \sin(\frac{2\pi}{12}) + e_t\]

cos.t <- cos(2*pi*t) # the period P=12 is already included in the time series
sin.t <- sin(2*pi*t)
model <- lm(logbeer ~ t + t2 + cos.t + sin.t)

plot(logbeer)
lines(t, model$fit, col="red", lwd=2)