Model Comparison¶
Note on Results
These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.
Side-by-side comparison of all models based on cross-dataset evaluation.
Important: Single-Dataset Rankings Are Unreliable
Models ranked by Dataset A alone placed gradient_boost_comprehensive (#1) and meta_stacking_7models (#2). But these dropped to #6 and #10 on Dataset B due to overfitting.
Always evaluate cross-dataset metrics (Stability Score, Robust Score).
Top Performers¶
| Category | Model | Robust Score | Stability |
|---|---|---|---|
| Best Robust Score | xgb_tuned_regularization | 0.715 | 96.3% |
| Highest Stability | xgb_selective_spectral | 0.643 | 99.7% |
| Fastest Training | xgb_core_7features | 0.606 | 98.0% |
| No ML Required | segment_statistics_only | 0.569 | 95.4% |
Comparison by Category¶
High Stability Models (>95%)¶
| Model | Stability | Robust Score | Dataset A | Dataset B |
|---|---|---|---|---|
| xgb_selective_spectral | 99.7% | 0.643 | 0.6451 | 0.6471 |
| weighted_dynamic_ensemble | 98.4% | 0.664 | 0.6742 | 0.6849 |
| quad_model_ensemble | 98.0% | 0.649 | 0.6756 | 0.6622 |
| xgb_core_7features | 98.0% | 0.606 | 0.6188 | 0.6315 |
| xgb_70_statistical | 97.1% | 0.631 | 0.6685 | 0.6493 |
| xgb_tuned_regularization | 96.3% | 0.715 | 0.7423 | 0.7705 |
Overfitting Models (High Dataset A, Low Stability)¶
Overfitting Observed
These models achieved high Dataset A performance but failed to generalize.
| Model | Dataset A | Dataset B | Drop | Stability |
|---|---|---|---|---|
| gradient_boost_comprehensive | 0.7930 | 0.6533 | -17.6% | 82.4% |
| meta_stacking_7models | 0.7662 | 0.6422 | -16.2% | 83.8% |
| knn_spectral_fft | 0.5793 | 0.4808 | -17.0% | 83.0% |
| knn_wavelet | 0.5812 | 0.4898 | -15.7% | 84.3% |
Fast Training Models¶
| Model | Train Time | Features | Robust Score | Stability |
|---|---|---|---|---|
| xgb_core_7features | 40s | 7 | 0.606 | 98.0% |
| segment_statistics_only | 43s | ~20 | 0.569 | 95.4% |
| xgb_30f_fast_inference | 53-142s | 30 | 0.596 | 94.9% |
Interpretable Models¶
| Model | Robust Score | Stability | Interpretability |
|---|---|---|---|
| segment_statistics_only | 0.569 | 95.4% | High - simple statistics |
| xgb_core_7features | 0.606 | 98.0% | Medium - SHAP available |
| hypothesis_testing_pure | 0.314 | 76.3% | High - but unstable |
Model Family Comparison¶
Tree-Based Models¶
| Model | Robust Score | Stability | Notes |
|---|---|---|---|
| xgb_tuned_regularization | 0.715 | 96.3% | Best overall |
| xgb_selective_spectral | 0.643 | 99.7% | Most stable |
| xgb_core_7features | 0.606 | 98.0% | Fastest |
| gradient_boost_comprehensive | 0.538 | 82.4% | Overfits |
Neural Networks¶
| Model | Robust Score | Stability | Notes |
|---|---|---|---|
| mlp_ensemble_deep_features | 0.647 | 95.3% | Best neural network |
| wavelet_lstm | 0.476 | 95.3% | Near random AUC |
| hierarchical_transformer | 0.435 | 89.4% | Near random AUC |
Neural Networks Underperform
Pure deep learning (LSTM, Transformer) achieved near-random performance (~0.50 AUC) on univariate data.
Ensembles¶
| Model | Robust Score | Stability | Notes |
|---|---|---|---|
| weighted_dynamic_ensemble | 0.664 | 98.4% | High stability |
| quad_model_ensemble | 0.649 | 98.0% | Stable |
| meta_stacking_7models | 0.538 | 83.8% | Overfits |
Reinforcement Learning¶
| Model | Robust Score | Stability | Notes |
|---|---|---|---|
| qlearning_rolling_stats | 0.470 | 92.5% | Near random |
| dqn_base_model_selector | 0.469 | 92.6% | Near random |
| qlearning_bayesian_cpd | 0.463 | 91.5% | Near random |
RL Models Near Random
All RL models achieved robust scores < 0.48 with ~0.50 AUC, indicating they failed to learn meaningful patterns from univariate features.
Feature Count vs. Stability¶
| Features | Model | Robust Score | Stability |
|---|---|---|---|
| 7 | xgb_core_7features | 0.606 | 98.0% |
| 30 | xgb_30f_fast_inference | 0.596 | 94.9% |
| 50 | xgb_70_statistical | 0.631 | 97.1% |
| 70+ | xgb_tuned_regularization | 0.715 | 96.3% |
| 100+ | gradient_boost_comprehensive | 0.538 | 82.4% |
| 339 | meta_stacking_7models | 0.538 | 83.8% |
More Features ≠ Better Generalization
Performance peaked around 70 features with strong regularization. Models with 100+ features showed the lowest stability scores.
Summary Tables¶
Top 5 by Robust Score¶
| Rank | Model | Robust Score | Stability |
|---|---|---|---|
| 1 | xgb_tuned_regularization | 0.715 | 96.3% |
| 2 | weighted_dynamic_ensemble | 0.664 | 98.4% |
| 3 | quad_model_ensemble | 0.649 | 98.0% |
| 4 | mlp_ensemble_deep_features | 0.647 | 95.3% |
| 5 | xgb_selective_spectral | 0.643 | 99.7% |
Lowest Performers¶
| Model | Robust Score | Stability | Issue |
|---|---|---|---|
| hypothesis_testing_pure | 0.314 | 76.3% | Low stability |
| welch_ttest | 0.333 | 71.9% | Lowest stability |
| knn_spectral_fft | 0.399 | 83.0% | Overfitting |
| hierarchical_transformer | 0.435 | 89.4% | Near random AUC |
| wavelet_lstm | 0.476 | 95.3% | Near random AUC |