Kullback-Leibler Divergence Between Two Dirichlet (and Beta) Distributions

By | September 10, 2013

Recently I’ve been working on learning parameters of a mixture of Dirichlet distributions, I needed a measure to check how good my algorithm works on synthetic data. I was advised to use Kullback-Leibler divergence, but its derivation was a little difficult. Here is the derivation:

Dirichlet distribution is a multivariate distribution with parameters $\alpha=[\alpha_1, \alpha_2, … , \alpha_K]$, with the following probability density function
$$p(x;\alpha) = \frac{\Gamma(\sum_{k=1}^K \alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k-1}$$
Kullback-Leibler divergence is defined as
$$KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx = \left < \log \frac{p(x)}{q(x)} \right>_{p(x)}$$
Let’s say we have two Dirichlet distributions \(p\) and \(q\), with parameters \(\alpha\) and \(\beta\) respectively. We write the KL divergence as
KL(p||q) &= \left < \log \frac{p(x)}{q(x)} \right>_{p(x)} \\
&= \left < \log p(x) - \log q(x) \right>_{p(x)} \\
&= \left < \log \Gamma(\alpha_0) - \sum_{k=1}^K \log \Gamma(\alpha_k) + \sum (\alpha_k-1) \log x_k \right . \\ & \quad \left . -\log \Gamma(\beta_0) + \sum_{k=1}^K \log \Gamma(\beta_k) - \sum (\beta_k-1) \log x_k \right >_{p(x)} \\
& = \log \Gamma(\alpha_0) – \sum_{k=1}^K \log \Gamma(\alpha_k) -\log \Gamma(\beta_0) \\
& \quad + \sum_{k=1}^K \log \Gamma(\beta_k) + \sum_{k=1}^K (\alpha_k – \beta_k) \left<\log x_k \right>_{p(x)}
where \(\alpha_0 = \sum_{k=1}^K \alpha_k \) and similarly \(\beta_0 = \sum_{k=1}^K \beta_k \).

Here, the geometric mean \(\left<\log x_k \right>_{p(x)}\) is equal to \(\psi(\alpha_k)-\psi(\alpha_0)\), where \(\psi\) is the digamma function. The details of calculating the geometric mean will be given below. Finally we have
KL(p||q) &= \log \Gamma(\alpha_0) – \sum_{k=1}^K \log \Gamma(\alpha_k) -\log \Gamma(\beta_0) + \sum_{k=1}^K \log \Gamma(\beta_k) + \sum_{k=1}^K (\alpha_k – \beta_k) (\psi(\alpha_k)-\psi(\alpha_0))

Matlab Code

The matlab code calculating the KL divergence is just a single expression. Given that alpha and beta are row vectors representing the two Dirichlet distribution parameters, the KL divergence is

D = gammaln(sum(alpha)) – gammaln(sum(beta)) – sum(gammaln(alpha)) + …
sum(gammaln(beta)) + (alpha – beta) * (psi(alpha) – psi(sum(alpha)))’;

You can also download it as a script here.

Calculating the Geometric Mean

Since \(\sum x = 1\), \(x\) has \(K-1\) degree of freedom. Therefore, we only need to take the integral over the first \(K-1\) components of \(x\):

\left<\log x_k \right>_{p(x)} &= \int \log x_k \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \log x_j \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1}\\
&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \int \log x_j \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1} \label{y1}\\
&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \int \frac{\partial}{\partial \alpha_k} \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1} \label{y2}\\
&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \frac{\partial}{\partial \alpha_k} \int \prod_{j}^K x_j^{\alpha_j-1} dx_{1:K-1}\\
&= \frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \frac{\partial}{\partial \alpha_k} \left ( \frac{\prod_{j}^K \Gamma(\alpha_j)}{\Gamma(\alpha_0)} \right ) \label{y3} \\
&= \frac{\partial}{\partial \alpha_k} \log \left (\frac{\Gamma(\alpha_0)}{\prod_{j}^K \Gamma(\alpha_j)} \right ) \label{y4}\\
&= \frac{\partial}{\partial \alpha_k} \log \Gamma(\alpha_k) – \frac{\partial}{\partial \alpha_k} \log \Gamma(\alpha_0)\\
&= \psi(\alpha_k)-\psi(\alpha_0)

We used the following properties:

  • From \eqref{y1} to \eqref{y2}: $$ \log a x^a = \frac{d}{d a} x^{a} \nonumber $$
  • From \eqref{y3} to \eqref{y4}:: $$ \frac{\Gamma'(x)}{\Gamma(x)} = \frac{d}{d x} \log \Gamma(x) = \psi(x) \nonumber $$
  • The remaining steps are straightforward.


    I thank my colleagues Deniz Akyildiz and Hakan Guldas for the “fruitful discussions” 🙂

    I also thanks Wikipedia writers of the pages of Dirichlet Distribution, Beta Distribution, Beta, Gamma and Digamma functions 🙂

    One thought on “Kullback-Leibler Divergence Between Two Dirichlet (and Beta) Distributions

    Leave a Reply

    Your email address will not be published. Required fields are marked *