Skip to content

Performance Summary

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

Complete results from running all 25 structural break detectors on two independent local validation datasets.

Dataset A vs Dataset B

Key Finding

Single-Dataset Benchmarks Are Misleading

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

Always evaluate cross-dataset generalization using Stability Score.

Cross-Dataset Metrics

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
Robust Score = Min(AUC_A, AUC_B) × Stability

Overall Statistics

  • Best Robust Score


    0.715 — xgb_tuned_regularization

  • Most Stable


    99.7% — xgb_selective_spectral

  • Top Performer


    xgb_tuned_regularization — best in local validation

  • Biggest Overfitter


    gradient_boost_comprehensive — dropped 17.6%

Full Rankings (by Robust Score)

Rank Detector Dataset A Dataset B Min AUC Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643
6 xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631
7 mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607
8 xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606
9 xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596
10 xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584
11 segment_statistics_only 0.6249 0.5963 0.5963 95.4% 0.569
12 meta_stacking_7models 0.7662 0.6422 0.6422 83.8% 0.538
13 gradient_boost_comprehensive 0.7930 0.6533 0.6533 82.4% 0.538
14 bayesian_bocpd_fused_lasso 0.5005 0.4884 0.4884 97.6% 0.477
15 wavelet_lstm 0.5249 0.5000 0.5000 95.3% 0.476
16 qlearning_rolling_stats 0.5488 0.5078 0.5078 92.5% 0.470
17 dqn_base_model_selector 0.5474 0.5067 0.5067 92.6% 0.469
18 kolmogorov_smirnov_xgb 0.4939 0.5205 0.4939 94.9% 0.469
19 qlearning_bayesian_cpd 0.5540 0.5067 0.5067 91.5% 0.463
20 hierarchical_transformer 0.5439 0.4862 0.4862 89.4% 0.435
21 qlearning_memory_tabular 0.4986 0.4559 0.4559 91.4% 0.417
22 knn_wavelet 0.5812 0.4898 0.4898 84.3% 0.413
23 knn_spectral_fft 0.5793 0.4808 0.4808 83.0% 0.399
24 welch_ttest 0.4634 0.6444 0.4634 71.9% 0.333
25 hypothesis_testing_pure 0.5394 0.4118 0.4118 76.3% 0.314

Stability Analysis

Most Stable Models

Model Stability Interpretation
xgb_selective_spectral 99.7% Near-identical performance
weighted_dynamic_ensemble 98.4% Excellent generalization
quad_model_ensemble 98.0% Excellent generalization
xgb_core_7features 98.0% Excellent generalization
bayesian_bocpd_fused_lasso 97.6% Strong generalization
xgb_70_statistical 97.1% Strong generalization
xgb_tuned_regularization 96.3% Strong generalization

Overfitting Alert (Stability < 85%)

Model Dataset A Rank Dataset B Rank AUC Drop Stability
gradient_boost_comprehensive #1 #6 -17.6% 82.4%
meta_stacking_7models #2 #10 -16.2% 83.8%
knn_spectral_fft #15 #23 -17.0% 83.0%
knn_wavelet #14 #20 -15.7% 84.3%
hypothesis_testing_pure #20 #25 -23.7% 76.3%
welch_ttest #25 #9 +39.1% 71.9%

welch_ttest Anomaly

welch_ttest showed high variance between datasets (0.46 → 0.64). Its low stability score (71.9%) indicates unpredictable generalization.

Dataset-Specific Results

Dataset A Rankings (by ROC AUC)

Rank Detector ROC AUC F1 Recall
1 gradient_boost_comprehensive 0.7930 0.4186 0.30
2 meta_stacking_7models 0.7662 0.5417 0.43
3 xgb_tuned_regularization 0.7423 0.5172 0.50
4 mlp_ensemble_deep_features 0.7122 0.2105 0.13
5 quad_model_ensemble 0.6756 0.3500 0.23

Dataset B Rankings (by ROC AUC)

Rank Detector ROC AUC F1 Recall
1 xgb_tuned_regularization 0.7705 0.5424 0.48
2 weighted_dynamic_ensemble 0.6849 0.3333 0.21
3 mlp_ensemble_deep_features 0.6787 0.3256 0.21
4 quad_model_ensemble 0.6622 0.3721 0.24
5 xgb_30f_fast_inference 0.6622 0.3556 0.24

Key Observations

Top Performers (Stable)

  • Regularized XGBoost models dominate — 6 of top 10 by robust score are XGBoost variants
  • Simplicity winsxgb_core_7features (7 features) outranks meta_stacking_7models (339 features)
  • Strong regularization supports generalizationxgb_tuned_regularization maintained consistent AUC across datasets

Underperformers

  • Complex ensembles overfit — gradient_boost and meta_stacking dropped dramatically
  • Deep learning struggles — Transformer (0.49-0.54) and LSTM (0.50-0.52) near random on both datasets
  • RL approaches fail — Q-learning models around 0.46-0.55 AUC
  • Pure statistical tests are unstable — hypothesis_testing_pure has 76.3% stability

Detailed Analysis