线性回归最大似然估计

Linear Regression MLE

专题: Machine Learning / 机器学习
难度: L9
来源: OpenQuant

题目详情

考虑线性回归设定。给定训练集 $\mathcal{D}:=\{(x_1,y_1),\ldots,(x_N,y_N)\}$ ，并假设在给定输入的条件下， $y_1,\ldots,y_N$ 条件独立。我们的目标是估计线性回归模型的参数 $\theta$ 。

最大似然估计定义为：

\theta_{\mathrm{MLE}}\in\arg\max_{\theta}\,p(\mathcal{Y}\mid\mathcal{X},\theta).

推导 $\theta_{\mathrm{MLE}}$ 的闭式解。

提示：对似然取对数，并利用“最大化对数似然”等价于“最小化负对数似然”。

Consider the linear regression setting with training data $\mathcal{D}:=\{(x_1,y_1),\ldots,(x_N,y_N)\}$ , and assume $y_1,\ldots,y_N$ are conditionally independent given their inputs.

The maximum likelihood estimator is

\theta_{\mathrm{MLE}}\in\arg\max_{\theta}\,p(\mathcal{Y}\mid\mathcal{X},\theta).

Assuming a Gaussian noise model, derive a closed-form expression for $\theta_{\mathrm{MLE}}$ .

不中返提示: work with the log-likelihood (equivalently, the negative log-likelihood).

解析

假设经典线性回归的高斯噪声模型：

y_n\mid x_n,\theta\sim \mathcal{N}(x_n^{\top}\theta,\sigma^2),\quad n=1,\ldots,N,

且条件独立，则似然为

p(\mathcal{Y}\mid\mathcal{X},\theta)=\prod_{n=1}^{N}\mathcal{N}(y_n\mid x_n^{\top}\theta,\sigma^2).

取对数并忽略与 $\theta$ 无关的常数项，最大化对数似然等价于最小化平方残差和：

\theta_{\mathrm{MLE}}\in\arg\min_{\theta}\sum_{n=1}^{N}(y_n-x_n^{\top}\theta)^2 =\arg\min_{\theta}\,\|y-X\theta\|^2.

对目标函数求导并令梯度为零：

\nabla_{\theta}\,\|y-X\theta\|^2=-2X^{\top}(y-X\theta)=0\;\Longrightarrow\;X^{\top}X\theta=X^{\top}y.

若 $X^{\top}X$ 可逆，则

\boxed{\theta_{\mathrm{MLE}}=(X^{\top}X)^{-1}X^{\top}y}.

若不可逆，则用伪逆得到最小二乘解： $\theta_{\mathrm{MLE}}=X^{+}y$ 。

Original Explanation

Assume the standard Gaussian noise model:

y_n\mid x_n,\theta\sim \mathcal{N}(x_n^{\top}\theta,\sigma^2),\quad n=1,\ldots,N,

with conditional independence. Then the likelihood is

p(\mathcal{Y}\mid\mathcal{X},\theta)=\prod_{n=1}^{N}\mathcal{N}(y_n\mid x_n^{\top}\theta,\sigma^2).

Taking logs and dropping constants independent of $\theta$ , maximizing the log-likelihood is equivalent to minimizing the sum of squared residuals:

\theta_{\mathrm{MLE}}\in\arg\min_{\theta}\sum_{n=1}^{N}(y_n-x_n^{\top}\theta)^2=\arg\min_{\theta}\,\|y-X\theta\|^2.

Differentiate and set the gradient to zero:

\nabla_{\theta}\,\|y-X\theta\|^2=-2X^{\top}(y-X\theta)=0\;\Longrightarrow\;X^{\top}X\theta=X^{\top}y.

If $X^{\top}X$ is invertible, the unique minimizer is

\boxed{\theta_{\mathrm{MLE}}=(X^{\top}X)^{-1}X^{\top}y}.

If it is not invertible, the MLE corresponds to a least-squares solution, e.g. $\theta_{\mathrm{MLE}}=X^{+}y$ using the Moore--Penrose pseudoinverse.