题目说明样本中离群点所占比例为 e,因此样本中不是离群点的比例可以表示为 1−e。
由于我们希望以 99% 的置信度至少得到一个完全不含离群点的样本,因此希望满足 P(Every Sample Contains Outlier)≤0.01。
P(Sample Contains 0 Outliers)=(1−e)s P(Sample Contains at Least 1 Outlier)=(1−(1−e)s) P(All Samples Contain at Least 1 Outlier)=(1−(1−e)s)N P(All Samples Contain at Least 1 Outlier)≤(1−p) (1−(1−e)s)N≤(1−p)
至此,我们已经得到一个约束条件:所有样本都含有离群点的概率必须小于 1%。进一步解出 N,得到所需样本数:
(1−(1−e)s)N≤(1−p) N>log(1−(1−e)s)log(1−p) N>log(1−(1−e)s)log(1−0.99)
Original Explanation
The problem states that the proportion of points in a sample that are outliers is e, and therefore the proportion of points in the sample that are not outliers can be represented as 1−e.
Since we want to be 99% confident that we get at least one sample that is completely void of outliers, we want to make the P(Every Sample Contains Outlier)≤0.01
P(Sample Contains 0 Outliers)=(1−e)s P(Sample Contains at Least 1 Outlier)=(1−(1−e)s) P(All Samples Contain at Least 1 Outlier)=(1−(1−e)s)N P(All Samples Contain at Least 1 Outlier)≤(1−p) (1−(1−e)s)N≤(1−p)
Finally, we've arrived at an equation that restricts the probability that all the samples contain an outlier to less than 1%. Finally, we can solve for N to find the number of samples we would need:
(1−(1−e)s)N≤(1−p) N>log(1−(1−es))log(1−p) N>log(1−(1−es))log(1−0.99)