Skip to content

Model Comparison

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

Side-by-side comparison of all models based on cross-dataset evaluation.

Important: Single-Dataset Rankings Are Unreliable

Models ranked by Dataset A alone placed gradient_boost_comprehensive (#1) and meta_stacking_7models (#2). But these dropped to #6 and #10 on Dataset B due to overfitting.

Always evaluate cross-dataset metrics (Stability Score, Robust Score).

Top Performers

Category Model Robust Score Stability
Best Robust Score xgb_tuned_regularization 0.715 96.3%
Highest Stability xgb_selective_spectral 0.643 99.7%
Fastest Training xgb_core_7features 0.606 98.0%
No ML Required segment_statistics_only 0.569 95.4%

Comparison by Category

High Stability Models (>95%)

Model Stability Robust Score Dataset A Dataset B
xgb_selective_spectral 99.7% 0.643 0.6451 0.6471
weighted_dynamic_ensemble 98.4% 0.664 0.6742 0.6849
quad_model_ensemble 98.0% 0.649 0.6756 0.6622
xgb_core_7features 98.0% 0.606 0.6188 0.6315
xgb_70_statistical 97.1% 0.631 0.6685 0.6493
xgb_tuned_regularization 96.3% 0.715 0.7423 0.7705

Overfitting Models (High Dataset A, Low Stability)

Overfitting Observed

These models achieved high Dataset A performance but failed to generalize.

Model Dataset A Dataset B Drop Stability
gradient_boost_comprehensive 0.7930 0.6533 -17.6% 82.4%
meta_stacking_7models 0.7662 0.6422 -16.2% 83.8%
knn_spectral_fft 0.5793 0.4808 -17.0% 83.0%
knn_wavelet 0.5812 0.4898 -15.7% 84.3%

Fast Training Models

Model Train Time Features Robust Score Stability
xgb_core_7features 40s 7 0.606 98.0%
segment_statistics_only 43s ~20 0.569 95.4%
xgb_30f_fast_inference 53-142s 30 0.596 94.9%

Interpretable Models

Model Robust Score Stability Interpretability
segment_statistics_only 0.569 95.4% High - simple statistics
xgb_core_7features 0.606 98.0% Medium - SHAP available
hypothesis_testing_pure 0.314 76.3% High - but unstable

Model Family Comparison

Tree-Based Models

Model Robust Score Stability Notes
xgb_tuned_regularization 0.715 96.3% Best overall
xgb_selective_spectral 0.643 99.7% Most stable
xgb_core_7features 0.606 98.0% Fastest
gradient_boost_comprehensive 0.538 82.4% Overfits

Neural Networks

Model Robust Score Stability Notes
mlp_ensemble_deep_features 0.647 95.3% Best neural network
wavelet_lstm 0.476 95.3% Near random AUC
hierarchical_transformer 0.435 89.4% Near random AUC

Neural Networks Underperform

Pure deep learning (LSTM, Transformer) achieved near-random performance (~0.50 AUC) on univariate data.

Ensembles

Model Robust Score Stability Notes
weighted_dynamic_ensemble 0.664 98.4% High stability
quad_model_ensemble 0.649 98.0% Stable
meta_stacking_7models 0.538 83.8% Overfits

Reinforcement Learning

Model Robust Score Stability Notes
qlearning_rolling_stats 0.470 92.5% Near random
dqn_base_model_selector 0.469 92.6% Near random
qlearning_bayesian_cpd 0.463 91.5% Near random

RL Models Near Random

All RL models achieved robust scores < 0.48 with ~0.50 AUC, indicating they failed to learn meaningful patterns from univariate features.

Feature Count vs. Stability

Features Model Robust Score Stability
7 xgb_core_7features 0.606 98.0%
30 xgb_30f_fast_inference 0.596 94.9%
50 xgb_70_statistical 0.631 97.1%
70+ xgb_tuned_regularization 0.715 96.3%
100+ gradient_boost_comprehensive 0.538 82.4%
339 meta_stacking_7models 0.538 83.8%

More Features ≠ Better Generalization

Performance peaked around 70 features with strong regularization. Models with 100+ features showed the lowest stability scores.

Summary Tables

Top 5 by Robust Score

Rank Model Robust Score Stability
1 xgb_tuned_regularization 0.715 96.3%
2 weighted_dynamic_ensemble 0.664 98.4%
3 quad_model_ensemble 0.649 98.0%
4 mlp_ensemble_deep_features 0.647 95.3%
5 xgb_selective_spectral 0.643 99.7%

Lowest Performers

Model Robust Score Stability Issue
hypothesis_testing_pure 0.314 76.3% Low stability
welch_ttest 0.333 71.9% Lowest stability
knn_spectral_fft 0.399 83.0% Overfitting
hierarchical_transformer 0.435 89.4% Near random AUC
wavelet_lstm 0.476 95.3% Near random AUC