线性回归指标

Linear Regression: Coefficients, Variance, LASSO, Ridge

专题: Statistics / 统计
难度: L4
来源: QuantQuestion

个人笔记

题目详情

解释 OLS 系数、它们的方差、 $R^2$ 、调整 $R^2$ ，并比较 LASSO 与 Ridge。

英文原题

Explain OLS coefficients, their variance, $R^2$ , adjusted $R^2$ , and compare LASSO vs Ridge.

解析

OLS 系数：在线性模型 $y=X\beta+\varepsilon$ （含截距可把 1 列并入 $X$ ）下，最小二乘解为

\hat\beta=(X^TX)^{-1}X^Ty,

前提是 $X^TX$ 可逆。

系数方差：若误差满足 $\mathbb{E}[\varepsilon\mid X]=0$ 、 $\mathrm{Var}(\varepsilon\mid X)=\sigma^2 I$ （同方差、独立），则

\mathrm{Var}(\hat\beta\mid X)=\sigma^2 (X^TX)^{-1}.

实际中用残差方差估计 $\sigma^2$ ，并由对角元得到每个系数的标准误。

$R^2$ ：衡量拟合优度

R^2=1-\frac{\mathrm{SSE}}{\mathrm{SST}}=\frac{\mathrm{SSR}}{\mathrm{SST}},

其中 SSE 为残差平方和，SST 为总平方和。加入自变量不会降低 $R^2$ 。

调整 $R^2$ ：惩罚加入过多自变量

\bar R^2 = 1-\frac{\mathrm{SSE}/(n-p)}{\mathrm{SST}/(n-1)},

其中 $n$ 为样本数， $p$ 为参数个数（含截距）。

Ridge vs LASSO：

Ridge（L2）：
$\min_\beta \|y-X\beta\|_2^2+\lambda\|\beta\|_2^2$
使系数连续收缩、降低方差，但通常不会把系数压到严格 0。
LASSO（L1）：
$\min_\beta \|y-X\beta\|_2^2+\lambda\|\beta\|_1$
倾向于产生稀疏解（部分系数被压到 0），可做变量选择。

两者都是通过引入偏差换取方差下降（bias-variance tradeoff）， $\lambda$ 越大收缩越强。

英文解析

OLS Coefficients: Under the linear model $y=X\beta+\varepsilon$ (where the intercept can be incorporated into $X$ as a column of ones), the least squares solution is

\hat\beta=(X^TX)^{-1}X^Ty,

provided that $X^TX$ is invertible.

Coefficient Variance: If the errors satisfy $\mathbb{E}[\varepsilon\mid X]=0$ and $\mathrm{Var}(\varepsilon\mid X)=\sigma^2 I$ (homoscedasticity and independence), then

\mathrm{Var}(\hat\beta\mid X)=\sigma^2 (X^TX)^{-1}.

In practice, $\sigma^2$ is estimated using the residual variance, and the standard errors for each coefficient are derived from the diagonal elements.

$R^2$ : Measures the goodness of fit

R^2=1-\frac{\mathrm{SSE}}{\mathrm{SST}}=\frac{\mathrm{SSR}}{\mathrm{SST}},

where SSE is the sum of squared residuals and SST is the total sum of squares. Adding predictors will not decrease $R^2$ .

Adjusted $R^2$ : Penalizes the inclusion of too many predictors

\bar R^2 = 1-\frac{\mathrm{SSE}/(n-p)}{\mathrm{SST}/(n-1)},

where $n$ is the sample size and $p$ is the number of parameters (including the intercept).

Ridge vs LASSO:

Ridge (L2):
$\min_\beta \|y-X\beta\|_2^2+\lambda\|\beta\|_2^2$
Causes coefficients to shrink continuously, reducing variance, but typically does not shrink coefficients to exactly zero.
LASSO (L1):
$\min_\beta \|y-X\beta\|_2^2+\lambda\|\beta\|_1$
Tends to produce sparse solutions (shrink some coefficients to zero) and can perform variable selection.

Both methods achieve a bias-variance tradeoff by introducing bias to reduce variance; a larger $\lambda$ results in stronger shrinkage.