返回 Guide 目录

统计学习 / Statistical Learning

线性回归

Linear Regression

本页结构

核心概念

  • 一元与多元线性回归 Simple and multiple linear regression
  • OLS、Gauss-Markov 假设与标准误 OLS, Gauss-Markov assumptions and standard errors
  • 统计推断、模型评估、稳健标准误与模型选择 Inference, model assessment, robust errors and selection

学习顺序

  1. 准确说明 OLS 最小化什么,以及假设分别保证什么。 Be precise about what OLS minimizes and what assumptions guarantee.
  2. 区分预测效果和系数解释。 Separate prediction quality from coefficient interpretation.
  3. 把异方差、多重共线性和遗漏变量作为常见失效模式处理。 Discuss heteroskedasticity, multicollinearity and omitted variables as failure modes.

概览

Overview

Linear Regression forms the basis for models like the Capital Asset Pricing Model (CAPM), factor models, and many trading strategies.

线性回归构成了资本资产定价模型 (CAPM)、因子模型和许多交易策略等模型的基础。

一、一元与多元线性回归

I. Simple and Multiple Linear Regression

Model Formulation

模型制定

The core assumption is a linear relationship between a dependent variable YY and one or more independent variables XiX_i.

核心假设是因变量 YY 与一个或多个自变量 XiX_i 之间存在线性关系。

  • YY: Dependent variable (e.g., stock return)
  • XiX_i: Independent variables/Predictors (e.g., market return, factors)
  • β0\beta_0: Intercept
  • βi\beta_i: Regression coefficients (slopes)
  • ϵ\epsilon: Error term (residual), representing unmodeled variation
  • YY: Dependent variable (e.g., stock return)
  • XiX_i: Independent variables/Predictors (e.g., market return, factors)
  • β0\beta_0: Intercept
  • βi\beta_i: Regression coefficients (slopes)
  • ϵ\epsilon: Error term (residual), representing unmodeled variation

Ordinary Least Squares (OLS) Estimation

普通最小二乘 (OLS) 估计

OLS finds the coefficients β^\hat{\beta} that minimize the Residual Sum of Squares (RSS): RSS=i=1m(yiy^i)2RSS = \sum_{i=1}^m (y_i - \hat{y}_i)^2.

OLS 找到最小化 残差平方和 (RSS) 的系数 β^\hat{\beta}RSS=i=1m(yiy^i)2RSS = \sum_{i=1}^m (y_i - \hat{y}_i)^2

Matrix Form (Multiple Regression): Given the data matrix X\mathbf{X} (including a column of ones for the intercept) and the response vector y\mathbf{y}, the OLS estimator is:

矩阵形式(多重回归): 给定数据矩阵 X\mathbf{X} (包括一列截距)和响应向量 y\mathbf{y},OLS 估计量为:

The variance-covariance matrix of the estimated coefficients is:

估计系数的方差-协方差矩阵为:

where σ2\sigma^2 is the variance of the error term, estimated by σ^2=1mp1i=1m(yiy^i)2\hat{\sigma}^2 = \frac{1}{m-p-1} \sum_{i=1}^m (y_i - \hat{y}_i)^2.

其中 σ2\sigma^2 是误差项的方差,由 σ^2=1mp1i=1m(yiy^i)2\hat{\sigma}^2 = \frac{1}{m-p-1} \sum_{i=1}^m (y_i - \hat{y}_i)^2 估计。

二、Gauss-Markov 定理与 OLS 假设

II. The Gauss-Markov Theorem and OLS Assumptions

The OLS estimator β^\hat{\beta} is the Best Linear Unbiased Estimator (BLUE) if the following assumptions (the Gauss-Markov assumptions) hold.

如果以下假设(高斯-马尔可夫假设)成立,则 OLS 估计器 β^\hat{\beta}最佳线性无偏估计器 (BLUE)

Assumption Description Financial Implication (Violation)
1. Linearity The model is linear in the parameters β\beta. Model misspecification (e.g., ignoring non-linear relationships).
2. Strict Exogeneity E[ϵiX]=0\mathbb{E}[\epsilon_i \mid \mathbf{X}] = 0. The error term is uncorrelated with the predictors. Endogeneity: Crucial violation in finance (e.g., simultaneity, omitted variable bias). Leads to biased and inconsistent estimators.
3. No Multicollinearity XX\mathbf{X}^\intercal \mathbf{X} is invertible (i.e., no perfect linear relationship between predictors). Inflated standard errors and unstable coefficient estimates.
4. Homoscedasticity Var(ϵiX)=σ2\mathrm{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2. The error variance is constant across all observations. Heteroscedasticity: Common in finance (e.g., high-return periods often have high volatility). OLS is unbiased, but standard errors are incorrect, leading to invalid inference.
5. No Autocorrelation Cov(ϵi,ϵjX)=0\mathrm{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0 for iji \ne j. Errors are uncorrelated across observations. Autocorrelation: Common in time series data (e.g., momentum strategies). OLS is unbiased, but standard errors are incorrect.
假设 描述 金融影响(违规)
1.线性 该模型在参数 β\beta 中是线性的。 模型指定错误(例如,忽略非线性关系)。
2.严格的外生性 E[ϵiX]=0\mathbb{E}[\epsilon_i \mid \mathbf{X}] = 0. The error term is uncorrelated with the predictors. 内生性:金融领域的严重违规(例如,同时性、遗漏变量偏差)。 Leads to biased and inconsistent estimators.
3.无多重共线性 XX\mathbf{X}^\intercal \mathbf{X} is invertible (i.e., no perfect linear relationship between predictors). 夸大的标准误差和不稳定的系数估计。
4.同方差性 Var(ϵiX)=σ2\mathrm{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2. The error variance is constant across all observations. 异方差性:在金融领域很常见(例如,高回报期通常具有高波动性)。 OLS 是无偏的,但标准误不正确,导致推断无效。
5.无自相关 Cov(ϵi,ϵjX)=0\mathrm{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0 for iji \ne j. Errors are uncorrelated across observations. 自相关:在时间序列数据中常见(例如动量策略)。 OLS 是无偏的,但标准误是不正确的。

Note: The OLS estimator is BLUE under assumptions 1-5. If we add the assumption that ϵN(0,σ2)\epsilon \sim N(0, \sigma^2), the OLS estimator is also the Maximum Likelihood Estimator (MLE).

:OLS 估计量在假设 1-5 下为蓝色。如果我们添加 ϵN(0,σ2)\epsilon \sim N(0, \sigma^2) 假设,则 OLS 估计器也是最大似然估计器 (MLE)

三、模型评估与推断

III. Model Assessment and Inference

Term Formula Intuition and Relevance
R2R^2 (Coefficient of Determination) 1RSSTSS1 - \frac{RSS}{TSS} Proportion of the variance in YY that is predictable from XX. In finance, a low R2R^2 is common and expected.
Adjusted R2R^2 1RSS/(mp1)TSS/(m1)1 - \frac{RSS/(m-p-1)}{TSS/(m-1)} Penalizes the inclusion of irrelevant predictors; a better measure for comparing models with different numbers of predictors (pp).
Standard Error (SE) of β^i\hat{\beta}_i Var(β^i)\sqrt{\text{Var}(\hat{\beta}_i)} Used to construct confidence intervals and perform hypothesis tests on individual coefficients.
tt-statistic t=β^iSE(β^i)t = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)} Used to test the null hypothesis H0:βi=0H_0: \beta_i = 0. Follows a tt-distribution with mp1m-p-1 degrees of freedom.
FF-statistic F=(TSSRSS)/pRSS/(mp1)F = \frac{(TSS - RSS)/p}{RSS/(m-p-1)} Used to test the overall significance of the model, H0:β1=β2==βp=0H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0.
学期 公式 直觉和相关性
R2R^2(决定系数) 1RSSTSS1 - \frac{RSS}{TSS} YY 中可从 XX 预测的方差比例。在金融领域,较低的 R2R^2 是常见的且符合预期。
调整后的 R2R^2 1RSS/(mp1)TSS/(m1)1 - \frac{RSS/(m-p-1)}{TSS/(m-1)} 惩罚包含不相关的预测变量;比较具有不同预测变量数量的模型的更好方法 (pp)。
β^i\hat{\beta}_i 的标准误差 (SE) Var(β^i)\sqrt{\text{Var}(\hat{\beta}_i)} 用于构建置信区间并对各个系数执行假设检验。
tt-统计 t=β^iSE(β^i)t = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)} 用于检验原假设 H0:βi=0H_0: \beta_i = 0。服从具有 mp1m-p-1 自由度的 tt 分布。
FF-统计 F=(TSSRSS)/pRSS/(mp1)F = \frac{(TSS - RSS)/p}{RSS/(m-p-1)} 用于检验模型的整体显着性,H0:β1=β2==βp=0H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0

四、处理假设违背与模型选择

IV. Dealing with Violations and Model Selection

Robust Standard Errors

稳健的标准误差

When Heteroscedasticity or Autocorrelation (or both) are present, the OLS standard errors are biased. Heteroscedasticity-Consistent (HC) Standard Errors (e.g., White's or Newey-West for autocorrelation) are used to correct the standard errors, allowing for valid statistical inference even when the error variance is not constant.

当存在 异方差自相关(或两者)时,OLS 标准误差存在偏差。 异方差一致 (HC) 标准误差(例如用于自相关的 White 或 Newey-West)用于校正标准误差,即使在误差方差不恒定时也可以进行有效的统计推断。

Regularization Methods (Shrinkage)

正则化方法(收缩)

These methods address the issue of Multicollinearity and Overfitting by adding a penalty term to the OLS objective function, shrinking the coefficients towards zero. This reduces the variance of the coefficient estimates at the cost of introducing a small bias (Bias-Variance Tradeoff).

这些方法通过向 OLS 目标函数添加惩罚项,将系数缩小到零来解决 多重共线性过度拟合 问题。这减少了系数估计的方差,但代价是引入了小偏差(偏差-方差权衡)。

Method Penalty Term Objective Function Effect
Ridge Regression λj=1pβj2\lambda \sum_{j=1}^p \beta_j^2 (L2 norm) RSS+λj=1pβj2RSS + \lambda \sum_{j=1}^p \beta_j^2 Shrinks all coefficients toward zero; effective for multicollinearity.
Lasso Regression λj=1pβj\lambda \sum_{j=1}^p \lvert \beta_j \rvert RSS+λj=1pβjRSS + \lambda \sum_{j=1}^p \lvert \beta_j \rvert Shrinks some coefficients exactly to zero; performs feature selection and works well for sparse models.
方法 处罚期限 目标函数 影响
岭回归 λj=1pβj2\lambda \sum_{j=1}^p \beta_j^2 (L2 norm) RSS+λj=1pβj2RSS + \lambda \sum_{j=1}^p \beta_j^2 将所有系数缩小到零;对多重共线性有效。
套索回归 λj=1pβj\lambda \sum_{j=1}^p \lvert \beta_j \rvert RSS+λj=1pβjRSS + \lambda \sum_{j=1}^p \lvert \beta_j \rvert 将一些系数精确缩小到零;执行特征选择并且适用于稀疏模型。

Bias-Variance Tradeoff

偏差-方差权衡

The expected prediction error (EPE) of a model f^(x)\hat{f}(x) can be decomposed:

模型 f^(x)\hat{f}(x) 的预期预测误差 (EPE) 可以分解:

  • Bias: Error from approximating a real-world function ff with a simpler model f^\hat{f}.
  • Variance: Error from the model being too sensitive to the training data.
  • Tradeoff: More complex models (e.g., high-degree polynomials) have low bias but high variance (overfitting). Simpler models (e.g., OLS) have high bias but low variance (underfitting). Regularization methods aim to find the optimal balance.
  • 偏差:使用更简单的模型 f^\hat{f} 逼近现实世界函数 ff 时产生的错误。
  • 方差:模型对训练数据过于敏感而产生的错误。
  • 权衡:更复杂的模型(例如,高次多项式)具有低偏差但高方差(过度拟合)。更简单的模型(例如 OLS)具有高偏差但低方差(欠拟合)。正则化方法旨在找到最佳平衡。

补充讲解

区分目标量和估计量

Separate estimand and estimator

OLS 在给定特征集合下估计总体投影系数。把系数解释为因果效应,需要比拟合效果更强的假设。

OLS estimates a population projection coefficient under a chosen feature set. Causal interpretation requires stronger assumptions than good in-sample fit.

不同假设保障不同结论

Assumptions have different jobs

线性设定定义模型,外生性控制偏误,同方差影响标准误,无完全共线性保证参数可识别。

Linearity defines the model, exogeneity controls bias, homoskedasticity affects standard errors, and no perfect collinearity ensures identification.

诊断也是答案的一部分

Diagnostics are part of the answer

专业的回归回答应覆盖残差结构、杠杆点、遗漏变量,以及是否需要稳健或聚类标准误。

A professional regression answer mentions residual structure, leverage points, omitted variables, and whether robust or clustered errors are needed.