Here I explain the model and the EM derivations in my own terms. I have also added a Matlab code which is optimised for speed.
The Model
We are going to work with discrete-state Markov chains so let’s call the cardinality of our discrete alphabet \(D\). A Markov model is defined by an initial state probabilities \(\pi\) and a state transition probabilities \(A\):
\[
\begin{align}
\pi(i) &= p(x(1)=i)\\
A(i,j) &= p(x(t)=i \mid x(t-1)=j)
\end{align}
\]
The likelihood of observing a sequence \(x\) is written as
\[
\begin{align}
p(x \mid \pi,A) &= \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{t=2}^{T} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{[x(t)=i][x(t-1)=j]}
\end{align}
\]
where \([]\) is the indicator function takes the value 1 if the expression inside it is true and 0 otherwise. I’m happier when I re-formulate this equation by getting rid of the product over the sequence length:
\[
\begin{align}
p(x \mid \pi, A) & \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{\sum_{t=1}^{T}[x(t)=i][x(t-1)=j]} \\
&= \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{s(i,j)}
\end{align}
\]
where \(s(i,j)\) is the transition counts from state \(i\) to state \(j\). The \(s\) together with the initial state \(x(1)\) forms the sufficient statistics for the sequence \(x\). Once we collect them in the beginning of the computation, we will have shorter likelihood calculations.
Our observations \(x=\{x_1,x_2,\ldots,x_N\}\) are generated by a mixture of Markov models. Meaning that each sequence \(x_n\) is generated by one of the \(K\) Markov models. The probability that \(x_n\) is generated by the \(k^{th}\) Markov model is denoted as
\[
p(z_n=k) = \alpha(k)
\]
From now on, we use the subscripts as enumerators, and the numbers in the parenthesis as array indices. So, \(x_n\) is the \(n^{th}\) sequence and \(x_n(t)\) is its \(t^{th}\) state. If there’s no subscript on a symbol, it denotes the whole set: \( x = \{x_1,\ldots,x_N\} \). Furthermore, we introduce \(\theta_k = \{\pi_k, A_k\}\) to denote the set of a Markov model parameters (for ease of notation).
In the mixture model, we calculate the likelihood of a sequence \(x\) given model parameters as
\[
\begin{align}
p(x|\theta) &= \sum_{k=1}^K p(z=k) p(x|z=k,\theta_k) \\
&= \sum_{k=1}^K \alpha(k) \left ( \prod_{i=1}^D \pi_k(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A_k(i,j)^{s(i,j)} \right)
\end{align}
\]
Expectation-Maximization Derivation
Given the observations \(x=\{x_1,x_2,\ldots,x_N\}\), we want to learn the parameters of Markov models \(\theta=\{\theta_1,\theta_2,\ldots,\theta_K\}\). Maximum likelihood estimation for this is
\[
\arg\!\max_\theta \log p(x \mid \theta) = \arg\!\max_\theta \log \sum_z p(z \mid \theta) p(x \mid z,\theta)
\]
This is not easy to maximize since there’s a summation over all possible latent state assignments (\(K^N\) possibilities) inside a logarithm. Therefore we need to maximize a tight lower bound for this:
\[
\begin{align}
L(\theta \mid \theta^{(t)}) &= \underbrace{\sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta)}_{E(\theta \mid \theta^{(t)})} – \underbrace{\sum_z p(z \mid x,\theta^{(t)}) \log p(z \mid x,\theta^{(t)})}_{H(\theta^{(t)})}
\end{align}
\]
where \(E(\theta \mid \theta^{(t)})\) is the energy part, and \(H(\theta^{(t)})\) is the entropy part and \(\theta^{(t)}\) is the maximum likelihood estimation of \(\theta\) at the \(t^{th}\) iteration. We will be able to maximize this bound easily to get the next estimate of the model parameters:
\[
\theta^{(t+1)} = \arg\!\max_{\theta} L(\theta \mid \theta^{(t)})
\]
The bound will be redefined by using the new expectation \(p(z \mid x,\theta^{(t+1)})\). And the process will continue until the likelihood converges.
Expectation Step
The expectation \(p(z \mid x,\theta_t)\) can be factorized as \(\prod_{n} p(z_n \mid x_n,\theta_t)\) since the observations are independent given \(\theta_t\). We calculate each expectation of \(z_n\) as
\[
\begin{align}
p(z_n \mid x_n,\theta^{(t)}) &= \frac{p(x_n \mid z_n,\theta^{(t)})p(z_n)}{p(x \mid \theta^{(t)})} \propto p(z_n) p(x_n \mid z_n, \theta^{(t)})
\end{align}
\]
In the Matlab code, we calculate \(p(z_n=k \mid x_n, \theta^{(t)})\) for each \(k=1\ldots K\), and normalize to get the distribution \(p(z_n \mid x_n, \theta^{(t)})\).
Maximization Step
We are going to maximize the bound \(L(\theta \mid \theta^{(t)})\) wrt. \(\pi_k\)’s and \(A_k(i,j)\)’s. The entropy term can be omitted, since it’s only a function of \(\theta^{(t)}\), therefore constant. So, we only need to maximize the energy term:
\[
\begin{align}
\theta^{(t+1)} &= \arg\!\max_{\theta} E(\theta \mid \theta^{(t)}) \\
&= \arg\!\max_{\theta} \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta)
\end{align}
\]
Let’s write the energy explicitly. First the joint likelihood \(p(x,z \mid \theta)\).
\[
\begin{align}
\log p(x,z \mid \theta) &= \log p(x \mid z, \theta) + \log p(z \mid \theta) \\
&= \log \prod_{n=1}^N p(x_n \mid z_n, \theta) + \log \prod_{n=1}^N p(z_n \mid \theta) \\
&= \log \prod_{n=1}^N p(x_n \mid z_n, \theta) + \log \prod_{n=1}^N p(z_n \mid \theta) \\
&= \sum_{n=1}^N \log p(x_n \mid z_n, \theta) + \sum_{n=1}^N \log p(z_n \mid \theta) \\
&= \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \left( \prod_{i=1}^D \pi_k(i)^{[x_n(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A_k(i,j)^{s_n(i,j)} \right) \\
& \quad + \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \alpha(k) \\
&= \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \left ( \sum_{i=1}^D [x_n(1)=i] \log \pi_k(i) + \sum_{i=1}^D \sum_{j=1}^D s_n(i,j) \log A_k(i,j) \right)\\
& \quad +\sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \alpha(k)
\end{align}
\]
Now, let’s write the expectation of this joint likelihood.
\[
\begin{align}
E(\theta \mid \theta^{(t)}) &= \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta) \\
&= \sum_{n=1}^N \sum_{k=1}^K p(z_n=k \mid x_n, \theta^{(t)}) \left ( \sum_{i=1}^D [x_n(1)=i] \log \pi_k(i) + \sum_{i=1}^D \sum_{j=1}^D s_n(i,j) \log A_k(i,j) \right) \\
& \quad + \sum_{n=1}^N \sum_{k=1}^K p(z_n=k \mid x_n, \theta^{(t)}) \log \alpha(k)
\end{align}
\]
How did we come to that? It’s not easy to see immediately. Keep in mind that \(\sum_z\) is an abuse of notation:
\[
\begin{align}
\sum_z = \sum_{z_1=1}^{K}\sum_{z_2=1}^{K}\ldots \sum_{z_N=1}^{K}
\end{align}
\]
If you do the algebra, you will arrive the above equation. Now, let’s maximize the energy wrt. \(\pi_k(i)\) to find \(\pi_k(i)^{(t+1)}\) (by using Lagrange multipliers):
\[
\begin{align}
\frac{\partial}{\partial \pi_k(i)} \left ( E(\theta \mid \theta^{(t)}) + \lambda( \sum_{j=1}^D \pi_k(j)-1) \right ) &=0 \\
\frac{\partial}{\partial \pi_k(i)} \left ( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \log \pi_k(i) + \lambda \pi_k(i) \right ) &=0 \\
\frac{1}{\pi_k(i)} \left( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \right) + \lambda &=0 \\
-\frac{1}{\lambda} \left( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \right) &= \pi_k(i)
\end{align}
\]
We calculate \(\lambda\) as:
\[
\begin{align}
1 &= \sum_{j=1}^D \pi_k(j) \\
&= \sum_{j=1}^D -\frac{1}{\lambda} \left( \sum_{i=n}^N p(z_n=k \mid x_n,\theta^{(t)})[x_n(1)=j] \right) \\
\lambda &= – \sum_{j=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=j]
\end{align}
\]
Finally we arrive at:
\[
\pi_k^{(t+1)}(i) = \frac{ \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)})[x_n(1)=i] }{\sum_{j=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)} )[x_n(1)=j]}
\]
This is easy to compute: we will calculate the numerator for all \(i=\{1,\ldots,D\}\) and normalize. Similarly, we can derive the following formulas:
\[
\begin{align}
A_k^{(t+1)}(i,j) &= \frac{\sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) s_n(i,j) }{\sum_{m=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) s_n(m,j)} \\
\alpha^{(t+1)}(k) &= \frac{\sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)})}{ \sum_{w=1}^K \sum_{n=1}^N p(z_n=w \mid x_n,\theta^{(t)}) }
\end{align}
\]
Calculating the likelihood
After every iteration, we should check the likelihood \(p(x \mid \theta)\) to see whether the process has converged to a solution. This likelihood has to be non-decreasing. If you a decrease on the likelihood, it’s 100% you did something wrong.
We can calculate the likelihood at the \(t^{th}\) step as
\[
p(x \mid \theta^{(t)}) = \prod_{n=1}^N p(x_n \mid \theta^{(t)}) = \prod_{n=1}^N \sum_{k=1}^K p(x_n \mid z_n=k, \theta^{(t)}) p(z_n=k \mid \theta^{(t)})
\]
This should be equal to \(L(\theta^{(t)} \mid \theta^{(t)})\). That is
\[
L(\theta^{(t)} \mid \theta^{(t)}) = \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta^{(t)}) – \sum_z p(z \mid x,\theta^{(t)}) \log p(z \mid x,\theta^{(t)})
\]
The code and the demo
Here I present my Matlab implementation for the EM algoerithm, which I believe runs quite fast. Instead of processing each sequence for calculating the likelihoods at each iteration, I calculate the sufficient statistics for each sequence in the beginning. Therefore, every sequence, independent of its length is presented by a \(D\times D\) matrix where \(D\) is the dimension of the input space and each likelihood is calculated by \(D^2\) multiplications. For calculating the sufficient statistics, I avoid for loops in the code by using accumarray function. Also in the E and M steps, I avoid for loops by making matrix multiplications.
Sample Data: sample_data.mat
Code: mixMarkovEM.m
Executing the code with the sample data is straightforward. There’s a nice help at the top of the code.
Bon Apetite.
hi, Baris, nice work, when I tried your program with the sample data, sometime the output does not make sense. For example,
[alpha,pi,A,z,LL] = mixMarkovEM(x,4,100,’maxIter’, 1000, ‘verbose’,true,’threshold’, 1e-5);
all the alpha are NaN, is there any underflow issue in the codes?
Another question is how to use your code to predict the next state based on current sequence, for example, I have 4,1,2,3 in hand, how to get next state? Thanks.
Hi James,
I’d corrected this problem, but forgot to upload it to my blog. It’s fixed, now you can download the code again. The problem was this: sometimes a column of the transition matrix becomes zero. This happens when we have no observation about that column. And when we try to normalize a column zeros, 0/0 gives NaN. After the correction, normalization function returns a column of zeros, not column of NaN’s. (correction at line 132). Much better solution would be to regularize the system by introducing Dirichlet Priors for the transition matrices. This way, the transition matrix estimates should not end up with a column of zeros at all. I’ll work on that.
Also, when using my sample data, your 3rd input argument must be 4, not 100. Because there are 4 mixture components.
You can calculate a distribution for the next state of a sequence x_n using the estimated transition matrices A and the estimated mixture labels z:
\begin{align}
p(x_n(t+1)=i) &= \sum_k^K p(z_i=k) p(x_n(t+1)=i | x_n(t), z_i=k) \\
&= \sum_k^K z(i,k) A(i,j,k)
\end{align}
Hi
This is very nice explanation. Your code is very useful for my work. How I can I modify this code if all of the sequences are of different lengths? Thanks in Advance.
Hi Punit,
I don’t have a Matlab currently installed on my computer, but I think the code should work with different length sequences. At the beginning of the code there’s a section for calculating sufficient statistics. The S tensor holds the number of state changes for each sequence. If you need to modify the code, look at that computation. The rest should need stay the same.
Cheers,
Baris.