An EM Exercise

statistics • learning • algorithms

The Expectation-Maximization (EM) algorithm is a popular method to obtain the Maximum Likelihood Estimate (MLE) when some of the data may be missing. See [Roche, 2012] for a nice tutorial on EM. The following problem is a nice exercise in working out the algorithm details to reinforce the concepts.

Problem

Let $X_1, \dots, X_n$ be i.i.d. from a normal distribution with unknown mean $\mu$ and variance $1$ . Suppose that negative values of $X_i$ are truncated at $0$ , so that instead of $X_i$ , we actually observe

$Y_i = \max(0, X_i), \quad i \in \{1,2, \dots, n\},$

from which we would like to estimate $\mu$ . By reordering, assume that $Y_1, \dots, Y_m > 0$ and $Y_{m+1} = \dots = Y_n = 0$ . Use the EM algorithm to estimate $\mu$ from $Y_1, \dots, Y_n$ .

EM Solution

Let $\phi(x)$ and $\Phi(x)$ denote the probability density function (pdf) and the cumulative distribution function (cdf), respectively, of the standard normal distribution . Let $\mathbf{x}$ denote the unobserved data and $\mathbf{y}$ denote the observed data. Then the joint pdf $f(\mathbf{y}, \mathbf{x} \vert \mu)$ of $(\mathbf{Y}, \mathbf{X})$ and the complete data likelihood is given by

$\begin{align} f(\mathbf{y}, \mathbf{x} \vert \mu) &= L(\mu \vert \mathbf{y},\mathbf{x}) \\ &= \prod_{i=1}^m \phi(y_i - \mu)\prod_{j=m+1}^n \phi(x_j - \mu). \end{align}$

We may obtain the pdf $g(\mathbf{y} \vert \mu)$ of the incomplete data, and hence the incomplete data likelihood, by integrating out $\mathbf{x}$ . Thus,

$\begin{align} g(\mathbf{y} \vert \mu) &= L(\mu \vert \mathbf{y}) \\ &= \prod_{i=1}^m \phi(y_i - \mu)\prod_{j=m+1}^n \int_{-\infty}^0 \phi(x_j - \mu) \, \mathrm{d}x_j \\ &= \prod_{i=1}^m \phi(y_i - \mu) \left[\Phi(-\mu)\right]^{n-m}. \end{align}$

From the above expression, we may obtain the conditional pdf of $\mathbf{X}$ given $\mathbf{y}$ and $\mu$ as

$\begin{align} k(\mathbf{x}\vert \mu, \mathbf{y}) &= \frac{f(\mathbf{y}, \mathbf{x} \vert \mu)}{g(\mathbf{y} \vert \mu)} \\ &= \left[\Phi(-\mu)\right]^{m-n} \prod_{j=m+1}^n \phi(x_j - \mu). \end{align}$

E-step

In the E-step, we compute the expected complete data log likelihood given the current estimate of the mean (say $\mu_r$ ) and the observed data, with the expectation taken over the unobserved data. In the M-step we shall find the value of the mean that maximizes this expected likelihood, denoted by $\mu_{r+1}$ , thereby forming a sequence that converges to the MLE.

Since the expectation in the E-step is used only to compute the value of $\mu$ that maximizes it, we may ignore the terms that do not contain $\mu$ . Thus

$\begin{align} \mathbb{E}\left[\log L(\mu \vert \mathbf{y},\mathbf{X}) \vert \mu_r, \mathbf{y}\right] &= \int_{\mathbf{x}} \log L(\mu \vert \mathbf{y},\mathbf{x}) k(\mathbf{x}\vert \mu_r, \mathbf{y})\, \mathrm{d}\mathbf{x}\\ &\propto -\left[ \sum_{i=1}^m {(y_i - \mu)}^2 + \frac{1}{\Phi(-\mu_r)}\sum_{j=m+1}^n \int_{-\infty}^0 {(x_j - \mu)}^2 \phi(x_j - \mu_r) \, \mathrm{d}x_j\right]. \end{align}$

To simplify the above expression, use the identity

$\int x^2 \phi(x) \, \mathrm{d}x = \Phi(x) - x\phi(x) + C,$

and a bit of algebra to yield

$\begin{align} \mathbb{E}\left[\log L(\mu \vert \mathbf{y},\mathbf{X}) \vert \mu_r, \mathbf{y}\right] &\propto -\left[ \sum_{i=1}^m {(y_i - \mu)}^2 + (n-m) {\left( \mu_r - \mu - \frac{\phi(\mu_r)}{\Phi(-\mu_r)}\right)}^2 \right]. \end{align}$

M-step

The estimate of $\mu$ that maximizes the above expression can be computed in multiple ways. One easy method is to notice that the functional form of the log likelihood has the same structure as that of the log likelihood of $f(\mathbf{y}, \mathbf{x} \vert \mu)$ , with each element of the vector $\mathbf{x}$ replaced by $\mu_r - \frac{\phi(\mu_r)}{\Phi(-\mu_r)}$ . Thus, using the expression for the MLE estimate of a normal distribution, we have

$\mu_{r+1} = \frac{1}{n} \left[\sum_{i=1}^m y_i + (n-m) \left( \mu_r - \frac{\phi(\mu_r)}{\Phi(-\mu_r)}\right)\right].$

This recursive formula allows us to compute the sequence $\{ \mu_1, \mu_2, \dots \}$ for some initial value $\mu_0$ .

Further Analysis

Let $\hat{\mu}$ denote the MLE estimate of the incomplete data likelihood. In other words,

$\hat{\mu} = \underset{\mu}{\text{arg max }} L(\mu \vert \mathbf{y}).$

Since $\hat{\mu}$ maximizes the log likelihood, the first derivative of the log likelihood evaluated at $\hat{\mu}$ is zero. Writing this down and rearranging the terms, we see that $\hat{\mu}$ satisfies the equation

$m \hat{\mu} = \sum_{i=1}^m y_i - (n-m)\frac{\phi(\hat{\mu})}{\Phi(-\hat{\mu})}.$

We can use the above expression to verify that the MLE estimate $\hat{\mu}$ is a fixed point of the recursion obtain in our M-step. Specifically, substituting $\mu_r = \hat{\mu}$ in the aforementioned recursion, we get

$\begin{align} \mu_{r+1} &= \frac{1}{n} \left[\sum_{i=1}^m y_i - (n-m)\frac{\phi(\hat{\mu})}{\Phi(-\hat{\mu})} + (n -m) \hat{\mu}\right] \\ &= \frac{1}{n} \left[m \hat{\mu} + (n-m) \hat{\mu} \right] \\ &= \hat{\mu}, \end{align}$

which proves that $\hat{\mu}$ is a fixed point of the recursion.

Convergence

See [Casella and Berger, 2002] for a proof that that the EM sequence converges in general to the incomplete data MLE. However, it is instructive to analyse the convergence of the EM sequence in this problem.

From the recursion for the EM sequence and the equation for the MLE, we see that

$\begin{align} (\hat{\mu} - \mu_{r+1}) = \frac{(n-m)}{n} \left[ (\hat{\mu} - \mu_r) + \frac{\phi(\mu_r)}{\Phi(-\mu_r)} - \frac{\phi(\hat{\mu})}{\Phi(-\hat{\mu})} \right]. \end{align}$

The ratio of the pdf and cdf in the above expression is known as the inverse Mills ratio. Define

$h(x) = \frac{\phi(x)}{\Phi(-x)}.$

The Mills ratio is fairly well studied in the literature. From [Sampord, 1953] we see that the inequality

$0 < h'(x) < 1,$

holds for $h'(x)$ , the derivative of $h(x)$ .

Assume that $\hat{\mu} > \mu_r$ . Then using the mean value theorem, there exists some $c \in (\mu_r, \hat{\mu})$ such that

$\begin{align} h(\hat{\mu}) - h(\mu_r) = h'(c) (\hat{\mu} - \mu_{r+1}). \end{align}$

This allows us the write the difference between the MLE estimate and the EM sequence as

$\begin{align} (\hat{\mu} - \mu_{r+1}) &= \frac{(n-m)}{n} (1 - h'(c)) (\hat{\mu} - \mu_r) \\ &< \frac{(n-m)}{n} (\hat{\mu} - \mu_r). \end{align}$

Thus, provided that at least one of the observations is not truncated, we obtain

$0 \le (\hat{\mu} - \mu_{r+1}) < (\hat{\mu} - \mu_r),$

which shows that $\mu_r \to \hat{\mu}$ , for any starting point $\mu_0 < \hat{\mu}$ . By a similar analysis it can be shown that

$0 \le ( \mu_{r+1} - \hat{\mu}) < (\mu_r - \hat{\mu}),$

for any starting point $\mu_0 > \hat{\mu}$ . Thus, $\mu_r \to \hat{\mu}$ for any starting point $\mu_0$ , thereby establishing that the EM sequence converges to the MLE estimate.

Nachikethas A. Jagadeesan