Structural Break Detection¶
Disclaimer
For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.
Note on Results
These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.
A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO. Evaluated on two independent datasets to measure generalization.
-
Best Robust Score
0.715 achieved by
xgb_tuned_regularization -
Most Stable
99.7% stability by
xgb_selective_spectral -
Top Performer
xgb_tuned_regularization— best in local validation -
Fastest Stable
xgb_core_7features— 40s, 98% stable
Key Finding: Single-Dataset Benchmarks Are Misleading¶
Overfitting Alert
Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.
gradient_boost_comprehensive: 0.7930 → 0.6533 (-17.6%)meta_stacking_7models: 0.7662 → 0.6422 (-16.2%)
Always validate on multiple datasets.
Cross-Dataset Metrics¶
We use Stability Score and Robust Score to measure generalization:
Top 5 by Robust Score¶
| Rank | Model | Dataset A | Dataset B | Stability | Robust Score |
|---|---|---|---|---|---|
| 1 | xgb_tuned_regularization |
0.7423 | 0.7705 | 96.3% | 0.715 |
| 2 | weighted_dynamic_ensemble |
0.6742 | 0.6849 | 98.4% | 0.664 |
| 3 | quad_model_ensemble |
0.6756 | 0.6622 | 98.0% | 0.649 |
| 4 | mlp_ensemble_deep_features |
0.7122 | 0.6787 | 95.3% | 0.647 |
| 5 | xgb_selective_spectral |
0.6451 | 0.6471 | 99.7% | 0.643 |
Why xgb_tuned_regularization is Top Performer¶
xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:
| Metric | xgb_tuned | gradient_boost (former #1) | meta_stacking (former #2) |
|---|---|---|---|
| Dataset A | 0.7423 | 0.7930 | 0.7662 |
| Dataset B | 0.7705 | 0.6533 | 0.6422 |
| Stability | 96.3% | 82.4% | 83.8% |
| Robust Score | 0.715 | 0.538 | 0.538 |
Why It's Top Performer
- Best Robust Score — Highest combined performance and stability in local validation
- Consistent on Dataset B — 0.7423 (A) → 0.7705 (B)
- Strong regularization — L1/L2 prevents overfitting
- Fast training — 60-185s vs hours for complex ensembles
Key Insights¶
What Works Best
-
Regularization is Critical: Models with aggressive L1/L2 regularization generalize better.
-
Simpler Models Generalize Better:
xgb_core_7features(7 features) has 98% stability vsmeta_stacking_7models(339 features) at 83.8%. -
Statistical Features Work: KS statistic, Cohen's d, t-test as features achieved higher robust scores than complex architectures.
What Doesn't Work
-
Complex ensembles overfit: 7-model stacking dropped from #2 to #10.
-
Deep learning fails on univariate data: Transformer and LSTM remain near-random on BOTH datasets.
-
Pure statistical tests are unstable:
hypothesis_testing_purehas only 76.3% stability.
Quick Start¶
# Install dependencies
pip install -r requirements.txt
# Train the top performer
cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib
# Run inference
python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib
Top Performers by Category¶
| Category | Model | Notes |
|---|---|---|
| Best Robust Score | xgb_tuned_regularization | 0.715 robust, 96.3% stable |
| Fastest Training | xgb_core_7features | 7 features, 40s training, 98% stable |
| Highest Stability | xgb_selective_spectral | 99.7% stability |
| No ML Required | segment_statistics_only | Statistical only, 95.4% stable |
Model Categories¶
- XGBoost variants (8 models)
- Gradient Boosting ensembles
- Random Forest combinations
- MLP ensembles
- Wavelet LSTM
- Hierarchical Transformer
- Meta Stacking (7 models)
- Quad Model Ensemble
- Voting ensembles
- Q-Learning variants
- DQN Model Selector
- Hybrid RL-Bayesian
- Hypothesis Testing
- Bayesian BOCPD
- Pure statistical baselines
License¶
This project is licensed under the MIT License.