**Abstract:**

In this paper, we develop a framework to estimate network flow length distributions in terms of the number of packets. We model the network flow length data as a three-way array with day-of-week, hour-of-day, and flow length as entities where we observe a count. In a high-speed network, only a sampled version of such an array can be observed and reconstructing the true flow statistics from fewer observations becomes a computational problem. We formulate the sampling process as matrix multiplication so that any sampling method can be used in our framework as long as its sampling probabilities are written in matrix form. We demonstrate our framework on a high-volume real-world data set collected from a mobile network provider with a random packet sampling and a flow-based packet sampling methods. We show that modeling the network data as a tensor improves estimations of the true flow length histogram in both sampling methods.

]]>**Abstract:**

Session Initiation Protocol (SIP), as one the most common signaling mechanism for Voice Over Internet Protocol (VoIP) applications, is a popular target for the flooding-based Distributed Denial of Service (DDoS) attacks. In this paper, we propose a DDoS attack detection framework based on the Bayesian multiple change model, which can detect different types of flooding attacks. Additionally, we propose a probabilistic SIP network simulation system that provides a test environment for network security tools.

]]>Here is the poster we are going to present.

]]>There are already serious libraries written by professionals, and in this project we use them. Our C++ library is just for coding the mostly used operations like matrix multiplications, normalizations, divergences, and random sampling from common probability distributions easier. We use GSL, CBLAS and LAPACK under the hood.

Our library is a header only library, since usability is the main focus. We are developing by using powerful C++11 features like rvalue references for efficiently dealing with large memory blocks. You can download the current version of the library from its GitHub link. Below is a sample code using the library:

#include <pml.hpp> #include <pml_random.hpp> using namespace pml; int main(){ Matrix M = Uniform::rand(4,5); // 4x5 uniform matrix Vector v = Poisson::rand(1, 4); // Vector of length 4, entries // drawn from Poisson with mean 1 Vector v2 = Dot(M, v); // Matrix-Vector Dot Product Vector v3 = Normalize(v2); // Normalize the result std::cout << v3 ; // Print vector v3 on screen v3.save("/tmp/v3.txt"); // Save v3 TO text file. return 0; }]]>

\begin{align}

\tau &\sim \mathcal{G}(\tau; \alpha, \beta) \\

\mu|\tau &\sim \mathcal{N}(\mu; \mu_0, 1/(n_0 \tau)) \\

x_i &\sim \mathcal{N}(x_i; \mu, \tau^{-1}) \quad \text{for } i \in [1, N] \\

y &= \{x_i | x_i > T\}

\end{align}

where $\mathcal{G}$ and $\mathcal{N}$ are the Gamma and Normal distributions respectivelty:

\begin{align}

\mathcal{N}(x; \mu, \tau^{-1}) &= \sqrt{ \frac{\tau}{2\pi}} \exp \left( -\frac{\tau}{2} (x-\mu)^2 \right) \\

\mathcal{G}(w; \alpha, \beta) &= \frac{\beta^\alpha}{\Gamma(\alpha)} w^{(\alpha-1)} \exp(-w \beta)

\end{align}

Let’s create an example:

Given all data points, $x_{1:N}$, the likelihood of $\mu$ and $\tau$ is

\begin{align}

p(x_{1:N}| \mu, \tau) \propto \tau^{N/2} \exp \left( -\frac{\tau}{2} \sum_{n=1}^N (x_n – \mu)^2\right)

\end{align}

The posterior for $\mu$ and $\tau$ is proportional to:

\begin{align}

p(\mu, \tau | x_{1:N} ) \propto p(x_{1:N} | \mu, \tau) p(\mu | \tau) p(\tau)

\end{align}

this technical report or better Murphy’s report

\begin{align}

\tau | x_{1:N} &\sim \mathcal{G} \left( \alpha + \frac{N}{2},\quad \beta + \frac{1}{2} \sum_{i=1}^N (x_i – \bar{x})^2 + \frac{N n_o}{2(N+n_0)}(\bar{x} – \mu_0)^2 \right) \\

\mu | \tau, x_{1:N} &\sim \mathcal{N} \left( \frac{N}{N+n_0} \bar{x} + \frac{n_0}{N+n_0} \mu_0,\quad (N+n_0)\tau \right)

\end{align}

where $\bar{x} = \sum_i x_i / N$ is the sample mean.

But, if we use the above formula for $p(\mu,\tau|y)$, we won’t be able to estimate the parameters of the censored Gaussian:

Our problem is to find the posterior distribution of $\{\mu, \tau\}$ pair given the observations $y_{1:M}$. First, we need to write down the probability of a

The probability of observing $y_i$ is zero for $y_i

\end{align}

We can calculate the normalizing constant as

\begin{align}

\int_{y_i=-\infty}^\infty p(y_i;\mu, \tau) &= 1 \\

\frac{1}{Z} \int_{y_i=T}^\infty \mathcal{N}(y_i; \mu, \tau) &= 1 \\

\frac{1}{Z} \left[ 1-\int_{y_i=-\infty}^T \mathcal{N}(y_i; \mu, \tau) \right]&= 1 \\

\frac{1}{Z} \left[ 1- \frac{1}{2} \left[ 1 – erf \left( -\mu\sqrt{\frac{\tau}{2}} \right) \right] \right]&= 1 \\

Z &= \frac{1}{2} \left[erf \left( -\mu\sqrt{\frac{\tau}{2}} \right) -1 \right]

\end{align}

We can calculate the posterior distribution for the model parameters as follows

\begin{align}

p(\mu, \tau | y_{1:M}) &\propto p(\mu) p(\tau)p(y_{1:M} | \mu, \tau) \\

&\propto p(\mu) p(\tau)\prod_{i=1}^M p(y_i | \mu, \tau)

\end{align}

We can maximize this posterior to get a point estimate for the model parameters. The closed form solution for this maximization is difficult due to the Gaussian Error Function (erf) in the formula, but we can draw samples from this posterior distribution via Metropolis-Hastings algorithm.

In order to sample $\{\mu, \tau\}$ pairs from the posterior $p(\mu, \tau | y_{1:M})$, we introduce the following proposal

\begin{align}

q(\mu’, \tau’| \mu, \tau) = q_1(\tau’| \tau) q_2(\mu’| \mu,\tau’)

\end{align}

such that

\begin{align}

q_1(\tau’| \tau) &= \mathcal{G}(\tau’; \tau, a_0 \beta) \\

q_2(\mu’| \mu,\tau’ ) &= \mathcal{N}(\mu’; \mu, b_0 \tau’)

\end{align}

Here we propose $\tau’$ from a Gamma distribution whose expectation is $\tau$. The variance of $\tau’$ is controlled by $a_0 \beta$. The larger $a_0$ results in longer jumps. Similarly $\mu’$ is drawn from a Gaussian distribution centered at $\mu$. Here, the variance is contolled by $b_0 \tau’$. Again, larger $b_0$ results in longer jumps. We calculate the acceptance ratio as follows:

\begin{align}

a &= \frac{p(\mu’, \tau’ | y_{1:M}) q(\mu, \tau | \mu’, \tau’) }{p(\mu, \tau | y_{1:M}) q(\mu’, \tau’ | \mu, \tau)}

\end{align}

We get the following result:

The observations $y_{1:M}$ is generated by sampling from a Gaussian with parameters $\mu, \tau$ and accepting the drawn sample if it’s above the threshold. This looks like a rejection sampling for $p(y_i|\mu, \tau)$ using a Gaussian proposal $\mathcal(y_i;\mu, \tau)$. The latent $x$ vector contains both the accepted and rejected samples. We can estimate the total number of proposed particles $N$, i.e. length of $x$, depending on the parameters ${\mu, \tau}$ and the number of accepted samples $M$.

We call the event that $x_i > T$ as a *aceept event* and $a_{\mu,\tau}$ as the proabaility of a succes event under the parameters $\mu$ and $\tau$, which is calculated as

\begin{align}

a_{\mu,\tau} := P(x>T) = 1 – P(x\leq T) = \frac{1}{2} \left[ 1 – erf \left( -\mu\sqrt{\frac{\tau}{2}} \right) \right]

\end{align}

For given $M \leq N$, we know that the number of success events are $M$ qith probability $a_{\mu,\tau}$, and the number of reject events are $N-M$ with probability $1-a_{\mu,\tau}$. Then, we can write for $N$ trials the probabiltiy of $M$ accepts is distributed by the given Binomial

\begin{align}

p(M|N,a_{\mu,\tau}) = \left( \begin{array}{l} N \\ M \end{array} \right) a_{\mu,\tau}^{M} (1-a_{\mu,\tau})^{N-M}

\end{align}

When we know the number of accept events $M$, the number of total trials are distributed as

\begin{align}

p(N|M,a_{\mu,\tau}) \propto p(M|N,a_{\mu,\tau}) p(N)

\end{align}

if we take $P(N)$ flat, we have $p(N|M,a_{\mu,\tau}) \propto p(M|N,a_{\mu,\tau})$.

You can download this blog post as an IPython Notebook which includes a Python implementation (requires Numpy, Scipy and Matplotlib).

]]>
The digamma function is the logarithmic derivative of the gamma function which is defined for the nonnegative real numbers. When you are working with Beta and Dirichlet distributions, you seen them frequently. Furthermore, if you want to estimate the parameters of a Diricihlet distribution, you need to take the inverse of the digamma function.

I was using the inverse digamma function (*invps*) of Paul Flacker (code is here). Since my experiments were taking a very long time, I profiled my matlab code, and found out that most of the time the program was taking time in the *invps* function. I started to search for a better technique and found it on Minka’s technical paper. (This is an excellent report by the way. Thanks.)

Our problem is to find $x$ , given $y$, such that $y=psi(x)$. The solution is

\begin{align}

x^* = \min_x \psi(x)-y

\end{align}

The Newton iteration for this problem is

\begin{align}

x^{(n+1)} = x^{(n)} – \frac{\psi(x^{(n)})-y}{\psi'(x^{(n)})}

\end{align}

In order to minimize the number of Newton iterations, we need to find a good initial value for $x$. The $\psi(x)$ function can be approximated by

\begin{align}

\psi(x) \approx g(x) = \left \{ \begin{matrix} \log(x-1/2) & \text{if } x\geq 0.6 \\ -(1/x+\psi(1)) & \text{if } x<0.6 \end{matrix} \right .
\end{align}
Therefore we initialize $x$ as
\begin{align}
x^{(1)} = \left \{ \begin{matrix} \exp(y)+1/2 & \text{if } y\geq -2.22 \\ -1/(y+\psi(1)) & \text{if } y<-2.22 \end{matrix} \right .
\end{align}

Here is the matlab implementation. I used only 3 fixed point iterations, which is quite satisfactory.

function Y=invpsi(X) %INVPSI - Inverse digamma (psi) function. % % The digamma function is the derivative of the log gamma function. This % function calculates the value Y > 0 for a value X such that % digamma(Y) = X. The code takes 3 Newton steps after initializing the Y % according to a good approximation of digamma function. % % Source: Thomas Minka, Estimating a Dirichlet distribution, % Tehcnical Report 2012, Appendix C. % % Change History : % Date Time Prog % 16-Sep-2013 23:40 PM Baris Kurt % Bogazici University, Dept. of Computer Eng. 80815 Bebek Istanbul Turkey % e-mail : bariskurt@gmail.com %initial estimate M = double(X >= -2.22); Y = M .*(exp(X) + 0.5) + (1-M) .* -1./(X-psi(1)); % make 3 Newton iterations: Y = Y - (psi(Y)-X)./psi(1,Y); Y = Y - (psi(Y)-X)./psi(1,Y); Y = Y - (psi(Y)-X)./psi(1,Y); end

You can also download it as a script invpsi.m

]]>Recently I’ve been working on learning parameters of a mixture of Dirichlet distributions, I needed a measure to check how good my algorithm works on synthetic data. I was advised to use Kullback-Leibler divergence, but its derivation was a little difficult. Here is the derivation:
### Matlab Code

### Calculating the Geometric Mean

From \eqref{y1} to \eqref{y2}: $$ \log a x^a = \frac{d}{d a} x^{a} \nonumber $$
From \eqref{y3} to \eqref{y4}:: $$ \frac{\Gamma'(x)}{\Gamma(x)} = \frac{d}{d x} \log \Gamma(x) = \psi(x) \nonumber $$
### Thanks

]]>Dirichlet distribution is a multivariate distribution with parameters $\alpha=[\alpha_1, \alpha_2, … , \alpha_K]$, with the following probability density function

$$p(x;\alpha) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k-1}$$

Kullback-Leibler divergence is defined as

$$KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx = \left < \log \frac{p(x)}{q(x)} \right>_{p(x)}$$

Let’s say we have two Dirichlet distributions \(p\) and \(q\), with parameters \(\alpha\) and \(\beta\) respectively. We write the KL divergence as

\begin{align}

KL(p||q) &= \left < \log \frac{p(x)}{q(x)} \right>_{p(x)} \\

&= \left < \log p(x) - \log q(x) \right>_{p(x)} \\

&= \left < \log \Gamma(\alpha_0) - \sum_{k=1}^K \log \Gamma(\alpha_k) + \sum (\alpha_k-1) \log x_k \right . \\
& \quad \left . -\log \Gamma(\beta_0) + \sum_{k=1}^K \log \Gamma(\beta_k) - \sum (\beta_k-1) \log x_k \right >_{p(x)} \\

& = \log \Gamma(\alpha_0) – \sum_{k=1}^K \log \Gamma(\alpha_k) -\log \Gamma(\beta_0) \\

& \quad + \sum_{k=1}^K \log \Gamma(\beta_k) + \sum_{k=1}^K (\alpha_k – \beta_k) \left<\log x_k \right>_{p(x)}

\end{align}

where \(\alpha_0 = \sum_{k=1}^K \alpha_k \) and similarly \(\beta_0 = \sum_{k=1}^K \beta_k \).

Here, the geometric mean \(\left<\log x_k \right>_{p(x)}\) is equal to \(\psi(\alpha_k)-\psi(\alpha_0)\), where \(\psi\) is the digamma function. The details of calculating the geometric mean will be given below. Finally we have

\begin{align*}

KL(p||q) &= \log \Gamma(\alpha_0) – \sum_{k=1}^K \log \Gamma(\alpha_k) -\log \Gamma(\beta_0) + \sum_{k=1}^K \log \Gamma(\beta_k) + \sum_{k=1}^K (\alpha_k – \beta_k) (\psi(\alpha_k)-\psi(\alpha_0))

\end{align*}

The matlab code calculating the KL divergence is just a single expression. Given that *alpha* and *beta* are row vectors representing the two Dirichlet distribution parameters, the KL divergence is

D = gammaln(sum(alpha)) – gammaln(sum(beta)) – sum(gammaln(alpha)) + …

sum(gammaln(beta)) + (alpha – beta) * (psi(alpha) – psi(sum(alpha)))’;

You can also download it as a script here.

Since \(\sum x = 1\), \(x\) has \(K-1\) degree of freedom. Therefore, we only need to take the integral over the first \(K-1\) components of \(x\):

\begin{align}

\left<\log x_k \right>_{p(x)} &= \int \log x_k \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \log x_j \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1}\\

&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \int \log x_j \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1} \label{y1}\\

&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \int \frac{\partial}{\partial \alpha_k} \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1} \label{y2}\\

&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \frac{\partial}{\partial \alpha_k} \int \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1}\\

&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \frac{\partial}{\partial \alpha_k} \left ( \frac{\prod_{j}^K \Gamma(\alpha_j)}{\Gamma(\alpha_0)} \right ) \label{y3} \\

&= \frac{\partial}{\partial \alpha_k} \log \left (\frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \right ) \label{y4}\\

&= \frac{\partial}{\partial \alpha_k} \log \Gamma(\alpha_k) – \frac{\partial}{\partial \alpha_k} \log \Gamma(\alpha_0)\\

&= \psi(\alpha_k)-\psi(\alpha_0)

\end{align}

We used the following properties:

The remaining steps are straightforward.

I thank my colleagues Deniz Akyildiz and Hakan Guldas for the “fruitful discussions”

I also thanks Wikipedia writers of the pages of Dirichlet Distribution, Beta Distribution, Beta, Gamma and Digamma functions

Here I present my Matlab implementation for learning Markov mixtures with EM algorithm. David Barber’s book explains the model and the EM derivation very nicely, and also provides a Matlab code package for it. Both the book and codes are open, you’re recommended to have look at them. However, Barber’s code is written for readability and not for performance. I’ve tried to write the code as efficient as possible. Additionally, if you are uncomfortable with the Expectation-Maximization algorithm, you can read this wonderful tutorial. It’s very detailed, and contains additional side information to help you to understand the method fully.
## The Model

## Expectation-Maximization Derivation

### Expectation Step

### Maximization Step

### Calculating the likelihood

## The code and the demo

]]>Here I explain the model and the EM derivations in my own terms. I have also added a Matlab code which is optimised for speed.

We are going to work with discrete-state Markov chains so let’s call the cardinality of our discrete alphabet \(D\). A Markov model is defined by an initial state probabilities \(\pi\) and a state transition probabilities \(A\):

\[

\begin{align}

\pi(i) &= p(x(1)=i)\\

A(i,j) &= p(x(t)=i \mid x(t-1)=j)

\end{align}

\]

The likelihood of observing a sequence \(x\) is written as

\[

\begin{align}

p(x \mid \pi,A) &= \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{t=2}^{T} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{[x(t)=i][x(t-1)=j]}

\end{align}

\]

where \([]\) is the indicator function takes the value 1 if the expression inside it is true and 0 otherwise. I’m happier when I re-formulate this equation by getting rid of the product over the sequence length:

\[

\begin{align}

p(x \mid \pi, A) & \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{\sum_{t=1}^{T}[x(t)=i][x(t-1)=j]} \\

&= \prod_{i=1}^D \pi(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A(i,j)^{s(i,j)}

\end{align}

\]

where \(s(i,j)\) is the transition counts from state \(i\) to state \(j\). The \(s\) together with the initial state \(x(1)\) forms the sufficient statistics for the sequence \(x\). Once we collect them in the beginning of the computation, we will have shorter likelihood calculations.

Our observations \(x=\{x_1,x_2,\ldots,x_N\}\) are generated by a mixture of Markov models. Meaning that each sequence \(x_n\) is generated by one of the \(K\) Markov models. The probability that \(x_n\) is generated by the \(k^{th}\) Markov model is denoted as

\[

p(z_n=k) = \alpha(k)

\]

From now on, we use the subscripts as enumerators, and the numbers in the parenthesis as array indices. So, \(x_n\) is the \(n^{th}\) sequence and \(x_n(t)\) is its \(t^{th}\) state. If there’s no subscript on a symbol, it denotes the whole set: \( x = \{x_1,\ldots,x_N\} \). Furthermore, we introduce \(\theta_k = \{\pi_k, A_k\}\) to denote the set of a Markov model parameters (for ease of notation).

In the mixture model, we calculate the likelihood of a sequence \(x\) given model parameters as

\[

\begin{align}

p(x|\theta) &= \sum_{k=1}^K p(z=k) p(x|z=k,\theta_k) \\

&= \sum_{k=1}^K \alpha(k) \left ( \prod_{i=1}^D \pi_k(i)^{[x(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A_k(i,j)^{s(i,j)} \right)

\end{align}

\]

Given the observations \(x=\{x_1,x_2,\ldots,x_N\}\), we want to learn the parameters of Markov models \(\theta=\{\theta_1,\theta_2,\ldots,\theta_K\}\). Maximum likelihood estimation for this is

\[

\arg\!\max_\theta \log p(x \mid \theta) = \arg\!\max_\theta \log \sum_z p(z \mid \theta) p(x \mid z,\theta)

\]

This is not easy to maximize since there’s a summation over all possible latent state assignments (\(K^N\) possibilities) inside a logarithm. Therefore we need to maximize a tight lower bound for this:

\[

\begin{align}

L(\theta \mid \theta^{(t)}) &= \underbrace{\sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta)}_{E(\theta \mid \theta^{(t)})} – \underbrace{\sum_z p(z \mid x,\theta^{(t)}) \log p(z \mid x,\theta^{(t)})}_{H(\theta^{(t)})}

\end{align}

\]

where \(E(\theta \mid \theta^{(t)})\) is the energy part, and \(H(\theta^{(t)})\) is the entropy part and \(\theta^{(t)}\) is the maximum likelihood estimation of \(\theta\) at the \(t^{th}\) iteration. We will be able to maximize this bound easily to get the next estimate of the model parameters:

\[

\theta^{(t+1)} = \arg\!\max_{\theta} L(\theta \mid \theta^{(t)})

\]

The bound will be redefined by using the new expectation \(p(z \mid x,\theta^{(t+1)})\). And the process will continue until the likelihood converges.

The expectation \(p(z \mid x,\theta_t)\) can be factorized as \(\prod_{n} p(z_n \mid x_n,\theta_t)\) since the observations are independent given \(\theta_t\). We calculate each expectation of \(z_n\) as

\[

\begin{align}

p(z_n \mid x_n,\theta^{(t)}) &= \frac{p(x_n \mid z_n,\theta^{(t)})p(z_n)}{p(x \mid \theta^{(t)})} \propto p(z_n) p(x_n \mid z_n, \theta^{(t)})

\end{align}

\]

In the Matlab code, we calculate \(p(z_n=k \mid x_n, \theta^{(t)})\) for each \(k=1\ldots K\), and normalize to get the distribution \(p(z_n \mid x_n, \theta^{(t)})\).

We are going to maximize the bound \(L(\theta \mid \theta^{(t)})\) wrt. \(\pi_k\)’s and \(A_k(i,j)\)’s. The entropy term can be omitted, since it’s only a function of \(\theta^{(t)}\), therefore constant. So, we only need to maximize the energy term:

\[

\begin{align}

\theta^{(t+1)} &= \arg\!\max_{\theta} E(\theta \mid \theta^{(t)}) \\

&= \arg\!\max_{\theta} \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta)

\end{align}

\]

Let’s write the energy explicitly. First the joint likelihood \(p(x,z \mid \theta)\).

\[

\begin{align}

\log p(x,z \mid \theta) &= \log p(x \mid z, \theta) + \log p(z \mid \theta) \\

&= \log \prod_{n=1}^N p(x_n \mid z_n, \theta) + \log \prod_{n=1}^N p(z_n \mid \theta) \\

&= \log \prod_{n=1}^N p(x_n \mid z_n, \theta) + \log \prod_{n=1}^N p(z_n \mid \theta) \\

&= \sum_{n=1}^N \log p(x_n \mid z_n, \theta) + \sum_{n=1}^N \log p(z_n \mid \theta) \\

&= \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \left( \prod_{i=1}^D \pi_k(i)^{[x_n(1)=i]} \prod_{i=1}^D \prod_{j=1}^D A_k(i,j)^{s_n(i,j)} \right) \\

& \quad + \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \alpha(k) \\

&= \sum_{n=1}^N \sum_{k=1}^K [z_n=k] \left ( \sum_{i=1}^D [x_n(1)=i] \log \pi_k(i) + \sum_{i=1}^D \sum_{j=1}^D s_n(i,j) \log A_k(i,j) \right)\\

& \quad +\sum_{n=1}^N \sum_{k=1}^K [z_n=k] \log \alpha(k)

\end{align}

\]

Now, let’s write the expectation of this joint likelihood.

\[

\begin{align}

E(\theta \mid \theta^{(t)}) &= \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta) \\

&= \sum_{n=1}^N \sum_{k=1}^K p(z_n=k \mid x_n, \theta^{(t)}) \left ( \sum_{i=1}^D [x_n(1)=i] \log \pi_k(i) + \sum_{i=1}^D \sum_{j=1}^D s_n(i,j) \log A_k(i,j) \right) \\

& \quad + \sum_{n=1}^N \sum_{k=1}^K p(z_n=k \mid x_n, \theta^{(t)}) \log \alpha(k)

\end{align}

\]

How did we come to that? It’s not easy to see immediately. Keep in mind that \(\sum_z\) is an abuse of notation:

\[

\begin{align}

\sum_z = \sum_{z_1=1}^{K}\sum_{z_2=1}^{K}\ldots \sum_{z_N=1}^{K}

\end{align}

\]

If you do the algebra, you will arrive the above equation. Now, let’s maximize the energy wrt. \(\pi_k(i)\) to find \(\pi_k(i)^{(t+1)}\) (by using Lagrange multipliers):

\[

\begin{align}

\frac{\partial}{\partial \pi_k(i)} \left ( E(\theta \mid \theta^{(t)}) + \lambda( \sum_{j=1}^D \pi_k(j)-1) \right ) &=0 \\

\frac{\partial}{\partial \pi_k(i)} \left ( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \log \pi_k(i) + \lambda \pi_k(i) \right ) &=0 \\

\frac{1}{\pi_k(i)} \left( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \right) + \lambda &=0 \\

-\frac{1}{\lambda} \left( \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=i] \right) &= \pi_k(i)

\end{align}

\]

We calculate \(\lambda\) as:

\[

\begin{align}

1 &= \sum_{j=1}^D \pi_k(j) \\

&= \sum_{j=1}^D -\frac{1}{\lambda} \left( \sum_{i=n}^N p(z_n=k \mid x_n,\theta^{(t)})[x_n(1)=j] \right) \\

\lambda &= – \sum_{j=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) [x_n(1)=j]

\end{align}

\]

Finally we arrive at:

\[

\pi_k^{(t+1)}(i) = \frac{ \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)})[x_n(1)=i] }{\sum_{j=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)} )[x_n(1)=j]}

\]

This is easy to compute: we will calculate the numerator for all \(i=\{1,\ldots,D\}\) and normalize. Similarly, we can derive the following formulas:

\[

\begin{align}

A_k^{(t+1)}(i,j) &= \frac{\sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) s_n(i,j) }{\sum_{m=1}^D \sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)}) s_n(m,j)} \\

\alpha^{(t+1)}(k) &= \frac{\sum_{n=1}^N p(z_n=k \mid x_n,\theta^{(t)})}{ \sum_{w=1}^K \sum_{n=1}^N p(z_n=w \mid x_n,\theta^{(t)}) }

\end{align}

\]

After every iteration, we should check the likelihood \(p(x \mid \theta)\) to see whether the process has converged to a solution. This likelihood has to be non-decreasing. If you a decrease on the likelihood, it’s 100% you did something wrong.

We can calculate the likelihood at the \(t^{th}\) step as

\[

p(x \mid \theta^{(t)}) = \prod_{n=1}^N p(x_n \mid \theta^{(t)}) = \prod_{n=1}^N \sum_{k=1}^K p(x_n \mid z_n=k, \theta^{(t)}) p(z_n=k \mid \theta^{(t)})

\]

This should be equal to \(L(\theta^{(t)} \mid \theta^{(t)})\). That is

\[

L(\theta^{(t)} \mid \theta^{(t)}) = \sum_z p(z \mid x,\theta^{(t)}) \log p(x,z \mid \theta^{(t)}) – \sum_z p(z \mid x,\theta^{(t)}) \log p(z \mid x,\theta^{(t)})

\]

Here I present my Matlab implementation for the EM algoerithm, which I believe runs quite fast. Instead of processing each sequence for calculating the likelihoods at each iteration, I calculate the sufficient statistics for each sequence in the beginning. Therefore, every sequence, independent of its length is presented by a \(D\times D\) matrix where \(D\) is the dimension of the input space and each likelihood is calculated by \(D^2\) multiplications. For calculating the sufficient statistics, I avoid for loops in the code by using accumarray function. Also in the E and M steps, I avoid for loops by making matrix multiplications.

Sample Data: sample_data.mat

Code: mixMarkovEM.m

Executing the code with the sample data is straightforward. There’s a nice help at the top of the code.

Bon Apetite.