Structural Break Detection¶

Disclaimer

For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO. Evaluated on two independent datasets to measure generalization.

Best Robust Score

0.715 achieved by xgb_tuned_regularization
Most Stable

99.7% stability by xgb_selective_spectral
Top Performer

xgb_tuned_regularization — best in local validation
Fastest Stable

xgb_core_7features — 40s, 98% stable

Key Finding: Single-Dataset Benchmarks Are Misleading¶

Overfitting Alert

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

gradient_boost_comprehensive: 0.7930 → 0.6533 (-17.6%)
meta_stacking_7models: 0.7662 → 0.6422 (-16.2%)

Always validate on multiple datasets.

Cross-Dataset Metrics¶

We use Stability Score and Robust Score to measure generalization:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
Robust Score = Min(AUC_A, AUC_B) × Stability

Top 5 by Robust Score¶

Rank	Model	Dataset A	Dataset B	Stability	Robust Score
1	`xgb_tuned_regularization`	0.7423	0.7705	96.3%	0.715
2	`weighted_dynamic_ensemble`	0.6742	0.6849	98.4%	0.664
3	`quad_model_ensemble`	0.6756	0.6622	98.0%	0.649
4	`mlp_ensemble_deep_features`	0.7122	0.6787	95.3%	0.647
5	`xgb_selective_spectral`	0.6451	0.6471	99.7%	0.643

View Full Results

Why xgb_tuned_regularization is Top Performer¶

xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:

Metric	xgb_tuned	gradient_boost (former #1)	meta_stacking (former #2)
Dataset A	0.7423	0.7930	0.7662
Dataset B	0.7705	0.6533	0.6422
Stability	96.3%	82.4%	83.8%
Robust Score	0.715	0.538	0.538

Why It's Top Performer

Best Robust Score — Highest combined performance and stability in local validation
Consistent on Dataset B — 0.7423 (A) → 0.7705 (B)
Strong regularization — L1/L2 prevents overfitting
Fast training — 60-185s vs hours for complex ensembles

Key Insights¶

What Works Best

Regularization is Critical: Models with aggressive L1/L2 regularization generalize better.
Simpler Models Generalize Better: xgb_core_7features (7 features) has 98% stability vs meta_stacking_7models (339 features) at 83.8%.
Statistical Features Work: KS statistic, Cohen's d, t-test as features achieved higher robust scores than complex architectures.

What Doesn't Work

Complex ensembles overfit: 7-model stacking dropped from #2 to #10.
Deep learning fails on univariate data: Transformer and LSTM remain near-random on BOTH datasets.
Pure statistical tests are unstable: hypothesis_testing_pure has only 76.3% stability.

Quick Start¶

# Install dependencies
pip install -r requirements.txt

# Train the top performer
cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib

# Run inference
python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib

Top Performers by Category¶

Category	Model	Notes
Best Robust Score	xgb_tuned_regularization	0.715 robust, 96.3% stable
Fastest Training	xgb_core_7features	7 features, 40s training, 98% stable
Highest Stability	xgb_selective_spectral	99.7% stability
No ML Required	segment_statistics_only	Statistical only, 95.4% stable

Model Categories¶

Tree-BasedNeural NetworksEnsemble MethodsReinforcement LearningStatistical

XGBoost variants (8 models)
Gradient Boosting ensembles
Random Forest combinations

MLP ensembles
Wavelet LSTM
Hierarchical Transformer

Meta Stacking (7 models)
Quad Model Ensemble
Voting ensembles

Q-Learning variants
DQN Model Selector
Hybrid RL-Bayesian

Hypothesis Testing
Bayesian BOCPD
Pure statistical baselines

License¶

This project is licensed under the MIT License.