Refs:

Markov Model

Consider a sequence \(x_1, x_2, \ldots, x_n\) of value from a certain domain. In this context we do not assume iid, ie, we assume that the next value is dependent of the previous values, like in time series. So, to predict \(x_{n+1}\) we would need to compute the following conditional:

\[p(x_{n+1} | x_1, x_2, \ldots, x_n)\]

which becomes exponentially complicated with \(n\).

A Markov Model is an assumption that \(x_{n+1}\) can be predicted (or approximated) just by knowing the previous value:

\[p(x_{n+1} | x_1, x_2, \ldots, x_n) \approx p(x_{n+1} | x_n)\]

This is also called a first-order markov chain. Higher orders would assume that we would need to know a certain number of previous values. This type of sequence is called a markov chain.

Graphically,

With this assumption, the joint distribution

\[p(x_1, x_2, \ldots, x_n) = p(x_1) p(x_2|x_1) \ldots p(x_n | x_1, x_2, \ldots, x_{n-1})\]

becomes much simpler

\[p(x_1, x_2, \ldots, x_n) = p(x_1) \prod_{t=2}^n p(x_t|x_{t-1})\]

So, what we need to define is the transition matrix that states the probabilities \(p(x_{t+1}=j|x_t=i)\) for all possible pairs of events.

The next R eg states a probability of tomorrow’s weather based on today’s weather:

A <- matrix(c(0.80, 0.05, 0.15,
              0.20, 0.60, 0.20,
              0.20, 0.30, 0.50), byrow=T, ncol=3)

rownames(A) <- c("sunny", "rainy", "foggy")
colnames(A) <- c("sunny", "rainy", "foggy")
A  # row is time T, col is time T+1
##       sunny rainy foggy
## sunny   0.8  0.05  0.15
## rainy   0.2  0.60  0.20
## foggy   0.2  0.30  0.50

We can ask questions like \(p(t_3 = \text{rainy} | t_1 = \text{foggy})\)?

In this case, we need to sum all the different probabilities for the different states at \(t_2\).

\[p(t_3 | t_1 = \text{foggy}) = \sum_{i=1}^3 p(t_3= \text{rainy}, t_2 = i | t_1 = \text{foggy}) = \\ \sum_{i=1}^3 p(t_3= \text{rainy} | t_2 = i) p(t_2 = i | t_1 = \text{foggy}) = \ldots = 0.34\]

However, this can quickly be computed with matrix multiplication:

initialState <- c(0, 0, 1)
initialState %*% A %*% A
##      sunny rainy foggy
## [1,]  0.32  0.34  0.34

Or using package markovchain (cf. vignette):

library(markovchain)

mcWeather <- new("markovchain", 
                 states = c("sunny", "rainy", "foggy"), 
                 byrow = T,
                 transitionMatrix = A, 
                 name = "Weather")

# p(state t3 | state t1 = foggy)?
initialState <- c(0, 0, 1)
after2Days <- initialState * (mcWeather^2)
after2Days
##      sunny rainy foggy
## [1,]  0.32  0.34  0.34
# another way to answer
conditionalDistribution(mcWeather^2, "foggy")
## sunny rainy foggy 
##  0.32  0.34  0.34

The package can do other nice stuff:

# sampling from a markov chain:
weathers <- rmarkovchain(n = 365, object = mcWeather, t0 = "sunny")
head(weathers)
## [1] "sunny" "sunny" "sunny" "sunny" "sunny" "sunny"
# fit a markov chain to a sequence:
fit <- markovchainFit(data = weathers, method = "mle") # MLE estimation
fit$estimate
## MLE Fit 
##  A  3 - dimensional discrete Markov Chain with following states 
##  foggy rainy sunny 
##  The transition matrix   (by rows)  is defined as follows 
##           foggy      rainy     sunny
## foggy 0.5981308 0.20560748 0.1962617
## rainy 0.2419355 0.53225806 0.2258065
## sunny 0.1435897 0.03589744 0.8205128
fit2 <- markovchainFit(data = weathers, method = "laplace", laplacian = 0.01) # Laplace smoothing
fit2$estimate
## Laplacian Smooth Fit 
##  A  3 - dimensional discrete Markov Chain with following states 
##  foggy rainy sunny 
##  The transition matrix   (by rows)  is defined as follows 
##           foggy      rainy     sunny
## foggy 0.5980566 0.20564328 0.1963001
## rainy 0.2419797 0.53216186 0.2258585
## sunny 0.1436189 0.03594319 0.8204379

Hidden Markov Model

A Hidden Markov Model (HMM) is an extension of Markov models where we introduce discrete variables \(z_i\) that we do not observe directly (called the latent or hidden states). We only observe some indirect evidence \(x_i\) that provide us with uncertain information about the states. States \(x_i\) can be discrete or continuous.

We still assume that the latent variables follows the Markov assumption, ie

\[p(z_1,\ldots,z_n) = p(z_1) \prod_{t=2}^n p(z_t|z_{t-1})\]

Now we must plug the observable variables. Again, we assume that the value of state \(x_i\) provides information only about \(z_i\), and is independent of all other states, ie, \(x_i \perp x_j\) and \(x_i \perp z_j\) where \(i \neq j\).

Graphically (shaded nodes represent observable variables) the HMM is described by:

In the HMM model, this is the joint distribution:

\[p(X,Z) = p(z_1) p(x_1|z_1) \prod_{t=2}^n p(z_t | z_{t-1}) p(x_t|z_t)\]

And, by Bayes Theorem:

\[p(Z|X) = \frac{p(X|Z) p(Z)}{p(X)} = \frac{\prod_{t=1}^n p(x_i|z_i) p(Z)}{p(X)}\]

where \(X=x_1,\ldots,x_n\) and \(Z=z_1,\ldots,z_n\). Most of these distributions must be given or be inferred from data.

The parameters can be divided as:

Let’s see an eg.

Assume that \(z_i \in \{-1,1\}\).

The transition probabilities are given by the next table

-1 +1
-1 0.800 0.200
+1 0.100 0.900

The emission probabilities are given by

\[x_i \sim \mathcal{N}(z_i,0.5^2)\]

And let’s assume that \(\pi(-1) = 1\), ie, it always starts at state -1

This next R snippet shows one possible set of latent values (which we will not know) and observable values produced by this model:

n <- 1000
t <- 1:n
sigma <- 0.3

# transition between value -1 and 1
Tr <- matrix(c(0.85,0.15,
               0.10,0.90), byrow=T, ncol=2)

set.seed(121)
mc <- new("markovchain", 
           states = c("-1", "+1"), 
           transitionMatrix = Tr,  byrow = T, name="eg")

zs <- as.numeric( rmarkovchain(n = n, object = mc, t0 = "-1") )
xs <- sapply(zs, function(zi) rnorm(1,zi,sigma))

plot(t[1:100], xs[1:100], pch=20)
points(t[1:100], zs[1:100], pch=18, col="red")
legend("bottomright", c("observable","latent"), col=1:2, pch=15, bty='n')

Usually, however, we only have the data and wish to infer the parameters.

Computation of the posterior marginals \(p(z_i)\) can be done using the forward-backward algorithm which is a dynamic programming application. The algorithm assumes that the transition, emission and the initial distributions are known.

The Viterbi algorithm is another dynamic programming technique to find the most likely sequence of hidden states that produce the known sequence of observable events.

For inference of the HMM’s parameters, we will use package depmixS4. This package is able to infer the previous parameters using the observable data xs, stating only that there are two states and assuming the emission distributions are gaussian ((cf. vignette).

library(depmixS4)

model     <- depmix(response=X ~ 1, data=data.frame(X=xs), nstates=2, family=gaussian())
fit_model <- fit(model, verbose=FALSE)
## converged at iteration 25 with logLik: -540.8507
summary(fit_model)
## Initial state probabilties model 
## pr1 pr2 
##   0   1 
## 
## 
## Transition matrix 
##             toS1       toS2
## fromS1 0.9113230 0.08867701
## fromS2 0.1410387 0.85896131
## 
## Response parameters 
## Resp 1 : gaussian 
##     Re1.(Intercept)    Re1.sd
## St1       1.0135439 0.3047032
## St2      -0.9980913 0.2816602
plot(ts(posterior(fit_model)[,1][1:100]))

Let’s recover the estimated parameters:

z_hat <- c(getmodel(fit_model, "response",1)@parameters$coefficients,
           getmodel(fit_model, "response",2)@parameters$coefficients)
z_hat
## (Intercept) (Intercept) 
##   1.0135439  -0.9980913
# we are assuming equal variance for both emission probabilities
sigma_hat <- mean(c(getmodel(fit_model, "response",1)@parameters$sd,
                    getmodel(fit_model, "response",2)@parameters$sd))
sigma_hat
## [1] 0.2931817
Tr_hat <- matrix( c(getmodel(fit_model, "transition",1)@parameters$coefficients,
                    getmodel(fit_model, "transition",2)@parameters$coefficients),
                  byrow=T, ncol=2)
Tr_hat <- round(Tr_hat,3)
Tr_hat
##       [,1]  [,2]
## [1,] 0.911 0.089
## [2,] 0.141 0.859
initial_state <- getmodel(fit_model, "prior")@parameters$coefficients
initial_state
## pr1 pr2 
##   0   1

The values found by the inference are quite near the real values.

Let’s make a resample from the fitted model, and compare with the actual data (we’ll use the same seed for the random generation):

# create a new markov chain object based on the inferred model
mc_hat <- new("markovchain", 
              states = c("z1", "z2"), 
              transitionMatrix = Tr_hat,  byrow = T, name="eg")

# create a sample from the estimated markov model object
set.seed(121)
zs_hat <- rmarkovchain(n = n, object = mc_hat, t0 = paste0("z",which.max(initial_state)))
zs_hat <- ifelse(zs_hat=="z1",z_hat[1],z_hat[2]) # convert from string to estimates
xs_hat <- sapply(zs_hat, function(zi) rnorm(1,zi,sigma_hat))

plot(t[1:100], xs_hat[1:100], pch=20, ylab="")
points(t[1:100], zs_hat[1:100], pch=18, col="red")
legend("bottomright", c("estimated observable","estimated latent"), col=1:2, pch=15, bty='n')