Skip to content

Structural Break Detection

Disclaimer

For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO. Evaluated on two independent datasets to measure generalization.

  • Best Robust Score


    0.715 achieved by xgb_tuned_regularization

  • Most Stable


    99.7% stability by xgb_selective_spectral

  • Top Performer


    xgb_tuned_regularization — best in local validation

  • Fastest Stable


    xgb_core_7features — 40s, 98% stable

Key Finding: Single-Dataset Benchmarks Are Misleading

Overfitting Alert

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

  • gradient_boost_comprehensive: 0.7930 → 0.6533 (-17.6%)
  • meta_stacking_7models: 0.7662 → 0.6422 (-16.2%)

Always validate on multiple datasets.

Cross-Dataset Metrics

We use Stability Score and Robust Score to measure generalization:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
Robust Score = Min(AUC_A, AUC_B) × Stability

Top 5 by Robust Score

Rank Model Dataset A Dataset B Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 99.7% 0.643

View Full Results

Why xgb_tuned_regularization is Top Performer

xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:

Metric xgb_tuned gradient_boost (former #1) meta_stacking (former #2)
Dataset A 0.7423 0.7930 0.7662
Dataset B 0.7705 0.6533 0.6422
Stability 96.3% 82.4% 83.8%
Robust Score 0.715 0.538 0.538

Why It's Top Performer

  1. Best Robust Score — Highest combined performance and stability in local validation
  2. Consistent on Dataset B — 0.7423 (A) → 0.7705 (B)
  3. Strong regularization — L1/L2 prevents overfitting
  4. Fast training — 60-185s vs hours for complex ensembles

Key Insights

What Works Best

  1. Regularization is Critical: Models with aggressive L1/L2 regularization generalize better.

  2. Simpler Models Generalize Better: xgb_core_7features (7 features) has 98% stability vs meta_stacking_7models (339 features) at 83.8%.

  3. Statistical Features Work: KS statistic, Cohen's d, t-test as features achieved higher robust scores than complex architectures.

What Doesn't Work

  1. Complex ensembles overfit: 7-model stacking dropped from #2 to #10.

  2. Deep learning fails on univariate data: Transformer and LSTM remain near-random on BOTH datasets.

  3. Pure statistical tests are unstable: hypothesis_testing_pure has only 76.3% stability.

Quick Start

# Install dependencies
pip install -r requirements.txt

# Train the top performer
cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib

# Run inference
python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib

Top Performers by Category

Category Model Notes
Best Robust Score xgb_tuned_regularization 0.715 robust, 96.3% stable
Fastest Training xgb_core_7features 7 features, 40s training, 98% stable
Highest Stability xgb_selective_spectral 99.7% stability
No ML Required segment_statistics_only Statistical only, 95.4% stable

Model Categories

  • XGBoost variants (8 models)
  • Gradient Boosting ensembles
  • Random Forest combinations
  • MLP ensembles
  • Wavelet LSTM
  • Hierarchical Transformer
  • Meta Stacking (7 models)
  • Quad Model Ensemble
  • Voting ensembles
  • Q-Learning variants
  • DQN Model Selector
  • Hybrid RL-Bayesian
  • Hypothesis Testing
  • Bayesian BOCPD
  • Pure statistical baselines

License

This project is licensed under the MIT License.