返回 Guide 目录

统计学习 / Statistical Learning

分类模型

Classification

本页结构

核心概念

  • 判别式模型与生成式模型 Discriminative versus generative classifiers
  • 逻辑回归、判别分析、k 近邻与朴素贝叶斯 Logistic regression, discriminant analysis, k-NN and Naive Bayes
  • 混淆矩阵、精确率、召回率、ROC 与阈值选择 Confusion matrix, precision, recall, ROC and threshold choice

学习顺序

  1. 先说清决策边界和损失函数。 Start from the decision boundary and loss function.
  2. 把指标选择和误报、漏报成本联系起来。 Tie metrics to the business cost of false positives and false negatives.
  3. 说明什么时候概率校准比单纯分类准确率更重要。 Explain when probability calibration matters.

概览

Overview

Classification methods are used to predict a discrete outcome, such as whether a stock price will go up or down, a company will default, or a trading signal will be positive or negative.

分类方法用于预测离散结果,例如股票价格会上涨还是下跌、公司是否会违约、或者交易信号是正还是负。

一、核心分类模型

I. Core Classification Models

1. Logistic Regression (Discriminative Model)

1.Logistic回归(判别模型)

Logistic Regression is a linear model used for binary classification. It models the probability of a class membership using the logistic (sigmoid) function to map a linear combination of predictors to a probability between 0 and 1.

逻辑回归是用于二元分类的线性模型。它使用 **logistic (sigmoid) 函数 ** 对类成员的概率进行建模,将预测变量的线性组合映射到 0 到 1 之间的概率。

  • Log-Odds: The model is linear in the log-odds (or logit):
  • 对数赔率:模型的对数赔率(或 logit)是线性的:
ln(P(Y=1x)P(Y=0x))=β0+xβ\ln\left(\frac{\mathbb{P}(Y=1 | \mathbf{x})}{\mathbb{P}(Y=0 | \mathbf{x})}\right) = \beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta}
  • Estimation: Coefficients β\boldsymbol{\beta} are estimated using Maximum Likelihood Estimation (MLE), as there is no closed-form solution.
  • Decision Boundary: The decision boundary is linear, defined by β0+xβ=0\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta} = 0.
  • 估计:系数 β\boldsymbol{\beta} 使用 最大似然估计 (MLE) 进行估计,因为没有封闭式解。
  • 决策边界:决策边界是线性的,由 β0+xβ=0\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta} = 0 定义。

2. Discriminant Analysis (Generative Model)

2. 判别分析(生成模型)

Discriminant Analysis models the distribution of the predictors X\mathbf{X} separately for each class kk, fk(x)=P(X=xY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k), and then uses Bayes' Theorem to find the posterior probability P(Y=kX=x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}).

判别分析分别对每个类 kkfk(x)=P(X=xY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k) 的预测变量 X\mathbf{X} 的分布进行建模,然后使用贝叶斯定理查找后验概率 P(Y=kX=x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x})

  • Linear Discriminant Analysis (LDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a common covariance matrix Σ\boldsymbol{\Sigma} across all classes. This results in a linear decision boundary.
  • Quadratic Discriminant Analysis (QDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a unique covariance matrix Σk\boldsymbol{\Sigma}_k for each class. This results in a quadratic decision boundary.
  • 线性判别分析 (LDA):假设 fk(x)f_k(\mathbf{x}) 是多元高斯分布,在所有类别中具有 公共协方差矩阵 Σ\boldsymbol{\Sigma}。这导致线性决策边界
  • 二次判别分析 (QDA):假设 fk(x)f_k(\mathbf{x}) 是多元高斯分布,每个类别都有一个唯一的协方差矩阵 Σk\boldsymbol{\Sigma}_k。这导致二次决策边界

3. kk-Nearest Neighbors (kk-NN) (Non-Parametric Model)

3. kk-最近邻(kk-NN)(非参数模型)

kk-NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the kk closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.

kk-NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the kk closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.

  • Key Parameter: kk (number of neighbors). A small kk leads to high variance (overfitting), while a large kk leads to high bias (underfitting).
  • Curse of Dimensionality: kk-NN performance degrades rapidly as the number of features (dimensions) increases, a common issue in high-dimensional financial data.
  • 关键参数kk(邻居数量)。小的 kk 会导致高方差(过度拟合),而大的 kk 会导致高偏差(欠拟合)。
  • 维数诅咒:随着特征(维度)数量的增加,kk-NN 性能迅速下降,这是高维金融数据中的常见问题。

4. Naive Bayes

4.朴素贝叶斯

Naive Bayes is a generative model that simplifies the estimation of fk(x)f_k(\mathbf{x}) by making the strong assumption that the predictors are conditionally independent given the class Y=kY=k.

朴素贝叶斯是一种生成模型,它通过做出强有力的假设来简化 fk(x)f_k(\mathbf{x}) 的估计,即在给定类 Y=kY=k 的情况下,预测变量条件独立

  • Advantage: Computationally efficient and performs surprisingly well in many real-world applications, especially text classification (e.g., sentiment analysis of news articles).
  • 优点:计算效率高,并且在许多实际应用中表现出奇的好,尤其是文本分类(例如新闻文章的情感分析)。

二、模型表现指标

II. Model Performance Metrics

In classification, simply measuring accuracy is often insufficient, especially with imbalanced datasets (e.g., credit default prediction).

在分类中,仅仅测量准确性通常是不够的,特别是对于不平衡的数据集(例如信用违约预测)。

Confusion Matrix

混淆矩阵

A 2×22 \times 2 table summarizing the model's performance on a test set.

2×22 \times 2 表总结了模型在测试集上的性能。

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
预测为阳性 预测阴性
实际积极 真阳性 (TP) 假阴性 (FN)
实际负面 误报 (FP) 真阴性 (TN)

Key Metrics

关键指标

Metric Formula Interpretation Relevance in Finance
Accuracy TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN} Overall correctness. Can be misleading for imbalanced data (e.g., 99% accuracy on 1% default rate).
Precision TPTP+FP\frac{TP}{TP + FP} Of all predicted positives, how many were correct? Important when the cost of a False Positive is high (e.g., a false trading signal).
Recall (Sensitivity) TPTP+FN\frac{TP}{TP + FN} Of all actual positives, how many were correctly identified? Important when the cost of a False Negative is high (e.g., failing to predict a default).
F1 Score 2PrecisionRecallPrecision+Recall2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} Harmonic mean of Precision and Recall; a balanced measure. Used to compare models when both FP and FN costs are significant.
公制 公式 解释 金融相关性
准确性 TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN} 总体正确性。 对于不平衡的数据可能会产生误导(例如,1% 的违约率下的准确度为 99%)。
精确 TPTP+FP\frac{TP}{TP + FP} 在所有预测的阳性结果中,有多少是正确的? 当误报的成本很高时(例如,错误的交易信号),这一点很重要。
召回率(灵敏度) TPTP+FN\frac{TP}{TP + FN} 在所有实际阳性结果中,有多少被正确识别? 当误报的成本很高时(例如,未能预测默认情况),这一点很重要。
F1分数 2PrecisionRecallPrecision+Recall2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} 精确率和召回率的调和平均值;一个平衡的措施。 用于在 FP 和 FN 成本都显着时比较模型。

ROC Curve and AUC

ROC曲线和AUC

  • ROC (Receiver Operating Characteristic) Curve: Plots the True Positive Rate (Recall) against the False Positive Rate (FPFP+TN\frac{FP}{FP + TN}) at various threshold settings.
  • AUC (Area Under the Curve): The area under the ROC curve. It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Interpretation: An AUC of 1.0 is a perfect classifier; 0.5 is no better than random guessing. AUC is a robust metric for imbalanced datasets.
  • ROC(接收器操作特性)曲线:绘制不同阈值设置下的 真阳性率(召回率)假阳性率 (FPFP+TN\frac{FP}{FP + TN})。
  • AUC(曲线下面积):ROC 曲线下的面积。它表示模型将随机选择的正实例排名高于随机选择的负实例的概率。 解释:AUC为1.0是一个完美的分类器; 0.5 并不比随机猜测更好。 AUC 是不平衡数据集的稳健指标。

补充讲解

分数不等于决策

Scores are not decisions

分类模型通常先输出分数或概率;真正的交易或风控决策取决于结合成本和约束选择阈值。

A classifier often produces a score or probability; the trading or risk decision comes after choosing a threshold tied to costs and constraints.

概率校准很重要

Calibration matters

在仓位 sizing、风险限额和预警系统中,概率校准可能比分类准确率更重要,因为后续动作会直接使用概率水平。

For position sizing, risk limits, and alert systems, calibrated probabilities can matter more than raw accuracy because downstream actions use probability levels.

类别不平衡会改变指标选择

Class imbalance changes metrics

当正类很稀少时,准确率容易误导。精确率、召回率、ROC、PR 曲线和期望效用都应绑定到不同错误类型的成本。

Accuracy can be misleading when positives are rare. Precision, recall, ROC, PR curves, and expected utility should be tied to the cost of each error type.