返回题库

完美样本

Perfect Samples

专题
Statistics / 统计
难度
L3

题目详情

假设你从总体中抽取大小为 ss 的样本。对每个样本,令 pp 表示该样本中所有点都是内点的概率,令 ee 表示样本中离群点所占的比例。请用 ssppee 推导出一个方程,用来计算所需的抽样次数 NN,使我们能以 99% 的置信度保证至少抽到一个不包含离群点的样本。

Assume that you are taking samples of size ss from a population. For each sample, let pp represent the probability that every point in that sample is an inlier and let ee represent the proportion of points in the sample that are outliers. Derive an equation in terms of ss, pp and ee, to calculate the number of samples (NN) needed in order to be 99% confident that there it at least one sample drawn that contains no outliers.

解析

题目说明样本中离群点所占比例为 ee,因此样本中不是离群点的比例可以表示为 1e1-e

由于我们希望以 99% 的置信度至少得到一个完全不含离群点的样本,因此希望满足 P(Every Sample Contains Outlier)0.01P(\textrm{Every Sample Contains Outlier}) \leq 0.01

P(Sample Contains 0 Outliers)=(1e)s P(Sample Contains at Least 1 Outlier)=(1(1e)s) P(All Samples Contain at Least 1 Outlier)=(1(1e)s)N P(All Samples Contain at Least 1 Outlier)(1p) (1(1e)s)N(1p)\begin{equation*} P(\textrm{Sample Contains 0 Outliers}) = (1-e)^s \end{equation*} \\\ \\ P(\textrm{Sample Contains at Least 1 Outlier)} = (1 - (1-e)^s) \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) = (1 - (1-e)^s)^N \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) \leq (1-p) \\\ \\ (1 - (1-e)^s)^N \leq (1-p)

至此,我们已经得到一个约束条件:所有样本都含有离群点的概率必须小于 1%。进一步解出 NN,得到所需样本数:

(1(1e)s)N(1p) N>log(1p)log(1(1e)s) N>log(10.99)log(1(1e)s)\begin{equation*} (1 - (1-e)^s)^N \leq (1-p) \end{equation*} \\\ \\ N \gt \frac{\log(1-p)}{\log(1 - (1-e)^s)} \\\ \\ \boxed{N \gt \frac{\log(1-0.99)}{\log(1 - (1-e)^s)}}

Original Explanation

The problem states that the proportion of points in a sample that are outliers is ee, and therefore the proportion of points in the sample that are not outliers can be represented as 1e1-e.

Since we want to be 99% confident that we get at least one sample that is completely void of outliers, we want to make the P(Every Sample Contains Outlier)0.01P(\textrm{Every Sample Contains Outlier}) \leq 0.01

P(Sample Contains 0 Outliers)=(1e)s P(Sample Contains at Least 1 Outlier)=(1(1e)s) P(All Samples Contain at Least 1 Outlier)=(1(1e)s)N P(All Samples Contain at Least 1 Outlier)(1p) (1(1e)s)N(1p)\begin{equation*} P(\textrm{Sample Contains 0 Outliers}) = (1-e)^s \end{equation*} \\\ \\ P(\textrm{Sample Contains at Least 1 Outlier)} = (1 - (1-e)^s) \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) = (1 - (1-e)^s)^N \\\ \\ P(\textrm{All Samples Contain at Least 1 Outlier}) \leq (1-p) \\\ \\ (1 - (1-e)^s)^N \leq (1-p)

Finally, we've arrived at an equation that restricts the probability that all the samples contain an outlier to less than 1%. Finally, we can solve for NN to find the number of samples we would need:

(1(1e)s)N(1p) N>log(1p)log(1(1es)) N>log(10.99)log(1(1es))\begin{equation*} (1 - (1-e)^s)^N \leq (1-p) \end{equation*} \\\ \\ N \gt \frac{\log(1-p)}{\log(1 - (1-e^s))} \\\ \\ \boxed{N \gt \frac{\log(1-0.99)}{\log(1 - (1-e^s))}}