Distribution Metrics¶
Metrics that directly quantify how different two probability distributions are.
Wasserstein Distance¶
Feature Name: wasserstein_distance
For 1D distributions: $$ W_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) - F_Q(x)| \, dx $$
For discrete samples: $$ W_1 = \frac{1}{n} \sum_{i=1}^{n} |x_{(i)} - y_{(i)}| $$
where x₍ᵢ₎ and y₍ᵢ₎ are sorted samples.
Also Known As¶
- Earth Mover's Distance (EMD)
- Kantorovich-Rubinstein metric
Properties¶
- True metric (satisfies triangle inequality)
- Sensitive to location shifts
- Well-defined even when distributions don't overlap
- Units match data units
Interpretation¶
- W = 0: Identical distributions
- For pure mean shift: W = |μ₁ - μ₂|
Why Useful
Wasserstein captures the "total amount of change" between distributions. Unlike KS (which only looks at maximum difference), it considers the entire shape.
Used by: meta_stacking_7models
ECDF Distance¶
Feature Name: ecdf_distance
where tᵢ are k evenly spaced evaluation points.
Properties¶
- Computationally simpler than exact Wasserstein
- Related to L1 distance between CDFs
Overlap Coefficient¶
Feature Name: overlap_coeff
For histograms: $$ \text{OVL} = \sum_i \min(h_{1i}, h_{2i}) \times \Delta x $$
Interpretation¶
| Value | Meaning |
|---|---|
| OVL = 1 | Identical distributions |
| OVL = 0 | No overlap (completely separated) |
| OVL = 0.5 | Half overlap |
Why Useful
Directly related to classification difficulty. If OVL = 0.8, about 80% of observations can't be reliably assigned to pre or post based on value alone.
Quantile Shift¶
Feature Name: quantile_shift
where F⁻¹(p) is the quantile function.
Interpretation¶
- QS = 0: No shift in any quantiles
- Large QS: Substantial distributional shift
- Related to Wasserstein distance
Used by: meta_stacking_7models