Performance Summary¶
Note on Results
These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.
Complete results from running all 25 structural break detectors on two independent local validation datasets.
Dataset A vs Dataset B¶
Key Finding¶
Single-Dataset Benchmarks Are Misleading
Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.
Always evaluate cross-dataset generalization using Stability Score.
Cross-Dataset Metrics¶
Overall Statistics¶
-
Best Robust Score
0.715 — xgb_tuned_regularization
-
Most Stable
99.7% — xgb_selective_spectral
-
Top Performer
xgb_tuned_regularization — best in local validation
-
Biggest Overfitter
gradient_boost_comprehensive — dropped 17.6%
Full Rankings (by Robust Score)¶
| Rank | Detector | Dataset A | Dataset B | Min AUC | Stability | Robust Score |
|---|---|---|---|---|---|---|
| 1 | xgb_tuned_regularization |
0.7423 | 0.7705 | 0.7423 | 96.3% | 0.715 |
| 2 | weighted_dynamic_ensemble |
0.6742 | 0.6849 | 0.6742 | 98.4% | 0.664 |
| 3 | quad_model_ensemble |
0.6756 | 0.6622 | 0.6622 | 98.0% | 0.649 |
| 4 | mlp_ensemble_deep_features |
0.7122 | 0.6787 | 0.6787 | 95.3% | 0.647 |
| 5 | xgb_selective_spectral |
0.6451 | 0.6471 | 0.6451 | 99.7% | 0.643 |
| 6 | xgb_70_statistical |
0.6685 | 0.6493 | 0.6493 | 97.1% | 0.631 |
| 7 | mlp_xgb_simple_blend |
0.6746 | 0.6399 | 0.6399 | 94.9% | 0.607 |
| 8 | xgb_core_7features |
0.6188 | 0.6315 | 0.6188 | 98.0% | 0.606 |
| 9 | xgb_30f_fast_inference |
0.6282 | 0.6622 | 0.6282 | 94.9% | 0.596 |
| 10 | xgb_importance_top15 |
0.6723 | 0.6266 | 0.6266 | 93.2% | 0.584 |
| 11 | segment_statistics_only |
0.6249 | 0.5963 | 0.5963 | 95.4% | 0.569 |
| 12 | meta_stacking_7models |
0.7662 | 0.6422 | 0.6422 | 83.8% | 0.538 |
| 13 | gradient_boost_comprehensive |
0.7930 | 0.6533 | 0.6533 | 82.4% | 0.538 |
| 14 | bayesian_bocpd_fused_lasso |
0.5005 | 0.4884 | 0.4884 | 97.6% | 0.477 |
| 15 | wavelet_lstm |
0.5249 | 0.5000 | 0.5000 | 95.3% | 0.476 |
| 16 | qlearning_rolling_stats |
0.5488 | 0.5078 | 0.5078 | 92.5% | 0.470 |
| 17 | dqn_base_model_selector |
0.5474 | 0.5067 | 0.5067 | 92.6% | 0.469 |
| 18 | kolmogorov_smirnov_xgb |
0.4939 | 0.5205 | 0.4939 | 94.9% | 0.469 |
| 19 | qlearning_bayesian_cpd |
0.5540 | 0.5067 | 0.5067 | 91.5% | 0.463 |
| 20 | hierarchical_transformer |
0.5439 | 0.4862 | 0.4862 | 89.4% | 0.435 |
| 21 | qlearning_memory_tabular |
0.4986 | 0.4559 | 0.4559 | 91.4% | 0.417 |
| 22 | knn_wavelet |
0.5812 | 0.4898 | 0.4898 | 84.3% | 0.413 |
| 23 | knn_spectral_fft |
0.5793 | 0.4808 | 0.4808 | 83.0% | 0.399 |
| 24 | welch_ttest |
0.4634 | 0.6444 | 0.4634 | 71.9% | 0.333 |
| 25 | hypothesis_testing_pure |
0.5394 | 0.4118 | 0.4118 | 76.3% | 0.314 |
Stability Analysis¶
Most Stable Models¶
| Model | Stability | Interpretation |
|---|---|---|
| xgb_selective_spectral | 99.7% | Near-identical performance |
| weighted_dynamic_ensemble | 98.4% | Excellent generalization |
| quad_model_ensemble | 98.0% | Excellent generalization |
| xgb_core_7features | 98.0% | Excellent generalization |
| bayesian_bocpd_fused_lasso | 97.6% | Strong generalization |
| xgb_70_statistical | 97.1% | Strong generalization |
| xgb_tuned_regularization | 96.3% | Strong generalization |
Overfitting Alert (Stability < 85%)¶
| Model | Dataset A Rank | Dataset B Rank | AUC Drop | Stability |
|---|---|---|---|---|
| gradient_boost_comprehensive | #1 | #6 | -17.6% | 82.4% |
| meta_stacking_7models | #2 | #10 | -16.2% | 83.8% |
| knn_spectral_fft | #15 | #23 | -17.0% | 83.0% |
| knn_wavelet | #14 | #20 | -15.7% | 84.3% |
| hypothesis_testing_pure | #20 | #25 | -23.7% | 76.3% |
| welch_ttest | #25 | #9 | +39.1% | 71.9% |
welch_ttest Anomaly
welch_ttest showed high variance between datasets (0.46 → 0.64). Its low stability score (71.9%) indicates unpredictable generalization.
Dataset-Specific Results¶
Dataset A Rankings (by ROC AUC)¶
| Rank | Detector | ROC AUC | F1 | Recall |
|---|---|---|---|---|
| 1 | gradient_boost_comprehensive | 0.7930 | 0.4186 | 0.30 |
| 2 | meta_stacking_7models | 0.7662 | 0.5417 | 0.43 |
| 3 | xgb_tuned_regularization | 0.7423 | 0.5172 | 0.50 |
| 4 | mlp_ensemble_deep_features | 0.7122 | 0.2105 | 0.13 |
| 5 | quad_model_ensemble | 0.6756 | 0.3500 | 0.23 |
Dataset B Rankings (by ROC AUC)¶
| Rank | Detector | ROC AUC | F1 | Recall |
|---|---|---|---|---|
| 1 | xgb_tuned_regularization | 0.7705 | 0.5424 | 0.48 |
| 2 | weighted_dynamic_ensemble | 0.6849 | 0.3333 | 0.21 |
| 3 | mlp_ensemble_deep_features | 0.6787 | 0.3256 | 0.21 |
| 4 | quad_model_ensemble | 0.6622 | 0.3721 | 0.24 |
| 5 | xgb_30f_fast_inference | 0.6622 | 0.3556 | 0.24 |
Key Observations¶
Top Performers (Stable)
- Regularized XGBoost models dominate — 6 of top 10 by robust score are XGBoost variants
- Simplicity wins —
xgb_core_7features(7 features) outranksmeta_stacking_7models(339 features) - Strong regularization supports generalization —
xgb_tuned_regularizationmaintained consistent AUC across datasets
Underperformers
- Complex ensembles overfit — gradient_boost and meta_stacking dropped dramatically
- Deep learning struggles — Transformer (0.49-0.54) and LSTM (0.50-0.52) near random on both datasets
- RL approaches fail — Q-learning models around 0.46-0.55 AUC
- Pure statistical tests are unstable — hypothesis_testing_pure has 76.3% stability