完美样本

Perfect Samples

专题: Statistics / 统计
难度: L3
来源: OpenQuant

题目详情

假设你从总体中抽取大小为 $s$ 的样本。对每个样本，令 $p$ 表示该样本中所有点都是内点的概率，令 $e$ 表示样本中离群点所占的比例。请用 $s$ 、 $p$ 和 $e$ 推导出一个方程，用来计算所需的抽样次数 $N$ ，使我们能以 99% 的置信度保证至少抽到一个不包含离群点的样本。

Assume that you are taking samples of size $s$ from a population. For each sample, let $p$ represent the probability that every point in that sample is an inlier and let $e$ represent the proportion of points in the sample that are outliers. Derive an equation in terms of $s$ , $p$ and $e$ , to calculate the number of samples ( $N$ ) needed in order to be 99% confident that there it at least one sample drawn that contains no outliers.

解析

题目说明样本中离群点所占比例为 $e$ ，因此样本中不是离群点的比例可以表示为 $1-e$ 。

由于我们希望以 99% 的置信度至少得到一个完全不含离群点的样本，因此希望满足 $P(\textrm{Every Sample Contains Outlier}) \leq 0.01$ 。

\begin{equation*} P(\textrm{Sample Contains 0 Outliers}) = (1-e)^s \end{equation*} \\\ \\ P(\textrm{Sample Contains at Least 1 Outlier)} = (1 - (1-e)^s) \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) = (1 - (1-e)^s)^N \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) \leq (1-p) \\\ \\ (1 - (1-e)^s)^N \leq (1-p)

至此，我们已经得到一个约束条件：所有样本都含有离群点的概率必须小于 1%。进一步解出 $N$ ，得到所需样本数：

\begin{equation*} (1 - (1-e)^s)^N \leq (1-p) \end{equation*} \\\ \\ N \gt \frac{\log(1-p)}{\log(1 - (1-e)^s)} \\\ \\ \boxed{N \gt \frac{\log(1-0.99)}{\log(1 - (1-e)^s)}}

Original Explanation

The problem states that the proportion of points in a sample that are outliers is $e$ , and therefore the proportion of points in the sample that are not outliers can be represented as $1-e$ .

Since we want to be 99% confident that we get at least one sample that is completely void of outliers, we want to make the $P(\textrm{Every Sample Contains Outlier}) \leq 0.01$

\begin{equation*} P(\textrm{Sample Contains 0 Outliers}) = (1-e)^s \end{equation*} \\\ \\ P(\textrm{Sample Contains at Least 1 Outlier)} = (1 - (1-e)^s) \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) = (1 - (1-e)^s)^N \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) \leq (1-p) \\\ \\ (1 - (1-e)^s)^N \leq (1-p)

Finally, we've arrived at an equation that restricts the probability that all the samples contain an outlier to less than 1%. Finally, we can solve for $N$ to find the number of samples we would need:

\begin{equation*} (1 - (1-e)^s)^N \leq (1-p) \end{equation*} \\\ \\ N \gt \frac{\log(1-p)}{\log(1 - (1-e^s))} \\\ \\ \boxed{N \gt \frac{\log(1-0.99)}{\log(1 - (1-e^s))}}