Skip to content

Hypothesis Testing Pure

Ensemble of statistical tests with no training required.

Performance

Metric Value Rank
ROC AUC 0.5394 20th
F1 Score 0.4167 11th
Accuracy 0.4455 25th
Recall 0.6667 1st
Train Time 0s Instant

Highest Recall

This model has the highest recall (0.67), catching more breaks than any other model — but at the cost of many false positives.

Architecture

Five statistical tests combined with weighted voting:

Time Series → [t-test, KS, CUSUM, LR, Bayes] → Weighted Score → Probability

Component Tests

1. Enhanced t-test (25% weight)

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]
score = -log₁₀(p) × min(n, n) / max(n, n)

2. Kolmogorov-Smirnov Test (20% weight)

\[ D = \sup_x |F_1(x) - F_2(x)| \]
score = D × (-log₁₀(p))

3. CUSUM Test (15% weight)

\[ S_t = \sum_{i=1}^{t} \frac{x_i - \hat{\mu}}{\hat{\sigma}} \]
score = max|S| in boundary region

4. Likelihood Ratio Test (20% weight)

\[ \Lambda = -2 \log\left(\frac{L_{\text{single}}}{L_{\text{two-segment}}}\right) \]

Under null hypothesis, Λ ~ χ²(df).

5. Bayesian Model Comparison (20% weight)

\[ \log BF = -\frac{1}{2}(BIC_{\text{two}} - BIC_{\text{single}}) \]
\[ P(\text{break}|\text{data}) = \frac{1}{1 + \exp(-\log BF - \log(\text{prior odds}))} \]

Final Score Calculation

score = (0.25 × t_score +
         0.20 × ks_score +
         0.15 × cusum_score +
         0.20 × lr_score +
         0.20 × bayes_score)

Advantages

  • No training required — Deploy immediately
  • Interpretable — Each component has statistical meaning
  • Fast — Instant predictions
  • Theoretically grounded — Based on established statistical theory

Limitations

  • Lower accuracy — 0.5394 AUC vs 0.7930 for best ML model
  • Many false positives — High recall (0.67) but low precision
  • Assumes specific distributions — May not capture complex patterns

Usage

cd hypothesis_testing_pure
python main.py --mode infer --data-dir /path/to/data

No --mode train needed — this model doesn't require training.

When to Use

Good For

  • Baseline comparison
  • When interpretability is critical
  • No training data available
  • Understanding statistical evidence

Avoid If

  • Need high accuracy (use ML models)
  • False positives are costly
  • Complex non-standard patterns