返回题库

线性回归最大似然估计

Linear Regression MLE

专题
Machine Learning / 机器学习
难度
L9

题目详情

考虑线性回归设定。给定训练集 D:={(x1,y1),,(xN,yN)}\mathcal{D}:=\{(x_1,y_1),\ldots,(x_N,y_N)\},并假设在给定输入的条件下,y1,,yNy_1,\ldots,y_N 条件独立。我们的目标是估计线性回归模型的参数 θ\theta

最大似然估计定义为:

θMLEargmaxθp(YX,θ).\theta_{\mathrm{MLE}}\in\arg\max_{\theta}\,p(\mathcal{Y}\mid\mathcal{X},\theta).

推导 θMLE\theta_{\mathrm{MLE}} 的闭式解。

提示:对似然取对数,并利用“最大化对数似然”等价于“最小化负对数似然”。

Consider the linear regression setting with training data D:={(x1,y1),,(xN,yN)}\mathcal{D}:=\{(x_1,y_1),\ldots,(x_N,y_N)\}, and assume y1,,yNy_1,\ldots,y_N are conditionally independent given their inputs.

The maximum likelihood estimator is

θMLEargmaxθp(YX,θ).\theta_{\mathrm{MLE}}\in\arg\max_{\theta}\,p(\mathcal{Y}\mid\mathcal{X},\theta).

Assuming a Gaussian noise model, derive a closed-form expression for θMLE\theta_{\mathrm{MLE}}.

不中返提示: work with the log-likelihood (equivalently, the negative log-likelihood).

解析

假设经典线性回归的高斯噪声模型:

ynxn,θN(xnθ,σ2),n=1,,N,y_n\mid x_n,\theta\sim \mathcal{N}(x_n^{\top}\theta,\sigma^2),\quad n=1,\ldots,N,

且条件独立,则似然为

p(YX,θ)=n=1NN(ynxnθ,σ2).p(\mathcal{Y}\mid\mathcal{X},\theta)=\prod_{n=1}^{N}\mathcal{N}(y_n\mid x_n^{\top}\theta,\sigma^2).

取对数并忽略与 θ\theta 无关的常数项,最大化对数似然等价于最小化平方残差和:

θMLEargminθn=1N(ynxnθ)2=argminθyXθ2.\theta_{\mathrm{MLE}}\in\arg\min_{\theta}\sum_{n=1}^{N}(y_n-x_n^{\top}\theta)^2 =\arg\min_{\theta}\,\|y-X\theta\|^2.

对目标函数求导并令梯度为零:

θyXθ2=2X(yXθ)=0    XXθ=Xy.\nabla_{\theta}\,\|y-X\theta\|^2=-2X^{\top}(y-X\theta)=0\;\Longrightarrow\;X^{\top}X\theta=X^{\top}y.

XXX^{\top}X 可逆,则

θMLE=(XX)1Xy.\boxed{\theta_{\mathrm{MLE}}=(X^{\top}X)^{-1}X^{\top}y}.

若不可逆,则用伪逆得到最小二乘解:θMLE=X+y\theta_{\mathrm{MLE}}=X^{+}y


Original Explanation

Assume the standard Gaussian noise model:

ynxn,θN(xnθ,σ2),n=1,,N,y_n\mid x_n,\theta\sim \mathcal{N}(x_n^{\top}\theta,\sigma^2),\quad n=1,\ldots,N,

with conditional independence. Then the likelihood is

p(YX,θ)=n=1NN(ynxnθ,σ2).p(\mathcal{Y}\mid\mathcal{X},\theta)=\prod_{n=1}^{N}\mathcal{N}(y_n\mid x_n^{\top}\theta,\sigma^2).

Taking logs and dropping constants independent of θ\theta, maximizing the log-likelihood is equivalent to minimizing the sum of squared residuals:

θMLEargminθn=1N(ynxnθ)2=argminθyXθ2.\theta_{\mathrm{MLE}}\in\arg\min_{\theta}\sum_{n=1}^{N}(y_n-x_n^{\top}\theta)^2=\arg\min_{\theta}\,\|y-X\theta\|^2.

Differentiate and set the gradient to zero:

θyXθ2=2X(yXθ)=0    XXθ=Xy.\nabla_{\theta}\,\|y-X\theta\|^2=-2X^{\top}(y-X\theta)=0\;\Longrightarrow\;X^{\top}X\theta=X^{\top}y.

If XXX^{\top}X is invertible, the unique minimizer is

θMLE=(XX)1Xy.\boxed{\theta_{\mathrm{MLE}}=(X^{\top}X)^{-1}X^{\top}y}.

If it is not invertible, the MLE corresponds to a least-squares solution, e.g. θMLE=X+y\theta_{\mathrm{MLE}}=X^{+}y using the Moore--Penrose pseudoinverse.