Detailed Analysis¶
Note on Results
These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.
In-depth analysis of cross-dataset experiment results, overfitting patterns, and model behavior.
Performance vs Stability Quadrant¶
The sweet spot is the upper-right: high robust score AND high stability. Models below the red line (85% stability) are considered unstable.
Complete Results Table¶
Click on any column header to sort by that metric.
| Detector | AUC A | AUC B | AUC Min | AUC Stab | AUC Robust | F1 A | F1 B | F1 Min | F1 Stab | F1 Robust |
|---|---|---|---|---|---|---|---|---|---|---|
| xgb_tuned_regularization | 0.7423 | 0.7705 | 0.7423 | 96.3% | 0.715 | 0.5172 | 0.5424 | 0.5172 | 95.4% | 0.493 |
| weighted_dynamic_ensemble | 0.6742 | 0.6849 | 0.6742 | 98.4% | 0.664 | 0.3000 | 0.3333 | 0.3000 | 90.0% | 0.270 |
| quad_model_ensemble | 0.6756 | 0.6622 | 0.6622 | 98.0% | 0.649 | 0.3500 | 0.3721 | 0.3500 | 94.1% | 0.329 |
| mlp_ensemble_deep_features | 0.7122 | 0.6787 | 0.6787 | 95.3% | 0.647 | 0.2105 | 0.3256 | 0.2105 | 64.6% | 0.136 |
| xgb_selective_spectral | 0.6451 | 0.6471 | 0.6451 | 99.7% | 0.643 | 0.4444 | 0.4211 | 0.4211 | 94.8% | 0.399 |
| xgb_70_statistical | 0.6685 | 0.6493 | 0.6493 | 97.1% | 0.631 | 0.4615 | 0.4800 | 0.4615 | 96.1% | 0.444 |
| mlp_xgb_simple_blend | 0.6746 | 0.6399 | 0.6399 | 94.9% | 0.607 | 0.3636 | 0.4091 | 0.3636 | 88.9% | 0.323 |
| xgb_core_7features | 0.6188 | 0.6315 | 0.6188 | 98.0% | 0.606 | 0.4675 | 0.4571 | 0.4571 | 97.8% | 0.447 |
| xgb_30f_fast_inference | 0.6282 | 0.6622 | 0.6282 | 94.9% | 0.596 | 0.3500 | 0.3556 | 0.3500 | 98.4% | 0.344 |
| xgb_importance_top15 | 0.6723 | 0.6266 | 0.6266 | 93.2% | 0.584 | 0.5231 | 0.3571 | 0.3571 | 68.3% | 0.244 |
| segment_statistics_only | 0.6249 | 0.5963 | 0.5963 | 95.4% | 0.569 | 0.4688 | 0.4194 | 0.4194 | 89.5% | 0.375 |
| meta_stacking_7models | 0.7662 | 0.6422 | 0.6422 | 83.8% | 0.538 | 0.5417 | 0.3111 | 0.3111 | 57.4% | 0.179 |
| gradient_boost_comprehensive | 0.7930 | 0.6533 | 0.6533 | 82.4% | 0.538 | 0.4186 | 0.3721 | 0.3721 | 88.9% | 0.331 |
| bayesian_bocpd_fused_lasso | 0.5005 | 0.4884 | 0.4884 | 97.6% | 0.477 | 0.0625 | 0.0571 | 0.0571 | 91.4% | 0.052 |
| wavelet_lstm | 0.5249 | 0.5000 | 0.5000 | 95.3% | 0.476 | 0.0000 | 0.0000 | 0.0000 | N/A | 0.000 |
| qlearning_rolling_stats | 0.5488 | 0.5078 | 0.5078 | 92.5% | 0.470 | 0.0645 | 0.0556 | 0.0556 | 86.2% | 0.048 |
| dqn_base_model_selector | 0.5474 | 0.5067 | 0.5067 | 92.6% | 0.469 | 0.4211 | 0.3951 | 0.3951 | 93.8% | 0.371 |
| kolmogorov_smirnov_xgb | 0.4939 | 0.5205 | 0.4939 | 94.9% | 0.469 | 0.3250 | 0.3636 | 0.3250 | 89.4% | 0.290 |
| qlearning_bayesian_cpd | 0.5540 | 0.5067 | 0.5067 | 91.5% | 0.463 | 0.0000 | 0.0000 | 0.0000 | N/A | 0.000 |
| hierarchical_transformer | 0.5439 | 0.4862 | 0.4862 | 89.4% | 0.435 | 0.0000 | 0.0000 | 0.0000 | N/A | 0.000 |
| qlearning_memory_tabular | 0.4986 | 0.4559 | 0.4559 | 91.4% | 0.417 | 0.3175 | 0.3125 | 0.3125 | 98.4% | 0.308 |
| knn_wavelet | 0.5812 | 0.4898 | 0.4898 | 84.3% | 0.413 | 0.1778 | 0.2128 | 0.1778 | 83.6% | 0.149 |
| knn_spectral_fft | 0.5793 | 0.4808 | 0.4808 | 83.0% | 0.399 | 0.2051 | 0.2051 | 0.2051 | 100.0% | 0.205 |
| welch_ttest | 0.4634 | 0.6444 | 0.4634 | 71.9% | 0.333 | 0.0000 | 0.0000 | 0.0000 | N/A | 0.000 |
| hypothesis_testing_pure | 0.5394 | 0.4118 | 0.4118 | 76.3% | 0.314 | 0.4167 | 0.3371 | 0.3371 | 80.9% | 0.273 |
Legend:
- AUC A/B: ROC-AUC on Dataset A/B
- AUC Min: Minimum AUC across datasets (worst-case)
- AUC Stab: AUC Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
- AUC Robust: AUC Robust Score = Min(AUC_A, AUC_B) × Stability
- F1 A/B: F1 Score on Dataset A/B
- F1 Min/Stab/Robust: Same calculations for F1 Score
Visual Analysis¶
Stability Score Ranking¶
The Core Finding: Overfitting Is Rampant¶
Our most important discovery is that single-dataset performance is unreliable. Models that appeared to be top performers on Dataset A failed dramatically on Dataset B.
Why Some Models Overfit¶
| Model | Likely Cause | Evidence |
|---|---|---|
| gradient_boost_comprehensive | Too many features, insufficient regularization | 100+ features, complex ensemble |
| meta_stacking_7models | Model complexity, 7 base models | 339 features, 32,000s training |
| knn_spectral_fft | High-dimensional spectral features | 152 FFT features |
| hypothesis_testing_pure | Threshold sensitivity | Hand-tuned weights |
Stability Score Analysis¶
The Stability Score measures how consistently a model performs across datasets:
Stability Distribution¶
Stability Score Distribution:
99-100% │ ██ (2 models)
95-99% │ ██████████████ (12 models)
90-95% │ ██████████ (6 models)
85-90% │ ██ (1 model)
80-85% │ ████ (2 models)
< 80% │ ████ (2 models)
└────────────────────────────
Why Regularization Works¶
The Top Performer: xgb_tuned_regularization¶
This model uses aggressive regularization:
XGBClassifier(
max_depth=5, # Shallow trees (vs 8-12 in others)
reg_alpha=0.5, # Strong L1 regularization
reg_lambda=2.0, # Strong L2 regularization
min_child_weight=10 # Larger minimum leaf weight
)
Performance Comparison¶
| Metric | xgb_tuned (top performer) | gradient_boost (overfit) | Difference |
|---|---|---|---|
| Dataset A AUC | 0.7423 | 0.7930 | -6.4% |
| Dataset B AUC | 0.7705 | 0.6533 | +17.9% |
| Stability | 96.3% | 82.4% | +16.9% |
| Robust Score | 0.715 | 0.538 | +32.9% |
The heavily regularized model lost 6.4% on the training-like Dataset A but gained 17.9% on the out-of-distribution Dataset B.
Class Imbalance Problem¶
Structural breaks are rare events. This creates fundamental model bias.
The All-Zeros Problem¶
Several models achieved ~70% accuracy by predicting "no break" for everything:
| Model | Dataset A Recall | Dataset B Recall | Behavior |
|---|---|---|---|
| hierarchical_transformer | 0.00 | 0.00 | All zeros |
| wavelet_lstm | 0.00 | 0.00 | All zeros |
| welch_ttest | 0.00 | 0.00 | All zeros |
| qlearning_bayesian_cpd | 0.00 | 0.00 | All zeros |
High Accuracy Can Be Misleading
A model predicting all zeros achieves 70% accuracy because ~70% of samples have no break. Always check recall and F1 score.
Cost Asymmetry¶
Not all errors are equal in structural break detection:
| Error Type | Description | Cost Level |
|---|---|---|
| False Negative (FN) | Missing a real break | Moderate |
| False Positive (FP) | Predicting break when none exists | Severe |
Why False Positives Are Costly¶
In trading applications, a false positive triggers unnecessary position changes:
- Guaranteed transaction fees
- Slippage costs
- Potential losses from wrong positioning
A false negative (missed break) represents a missed opportunity, but doesn't incur direct losses.
Metric Implications
This asymmetry means F1 score (which penalizes both FP and FN) is critical for model selection.
Why Deep Learning Models Failed¶
The Core Problem: Univariate Features Are Insufficient¶
Deep learning architectures (Transformers, LSTMs, RL agents) are designed to learn relationships between multiple input variables. With only:
- A single univariate time series
- Statistical features derived from that same series
...these models lack the rich, multi-dimensional input space needed to learn meaningful patterns.
Cross-Dataset Confirmation¶
The failure is consistent across BOTH datasets, confirming it's fundamental:
| Model | Dataset A | Dataset B | Interpretation |
|---|---|---|---|
| hierarchical_transformer | 0.5439 | 0.4862 | Near random |
| wavelet_lstm | 0.5249 | 0.5000 | Near random |
| dqn_base_model_selector | 0.5474 | 0.5067 | Near random |
| qlearning_bayesian_cpd | 0.5540 | 0.5067 | Near random |
LSTM Limitations¶
- Long-term dependency issues — Fail to memorize long sequential information
- Static forget gates — Cannot revise storage decisions dynamically
- Regime shift blindness — Cannot anticipate breaks without external event data
Transformer Limitations¶
- Data hunger — Require large datasets for meaningful attention patterns
- Isolation penalty — Underperform when used alone for numerical prediction
- Class imbalance sensitivity — Cross-entropy loss pushes toward majority class
RL Agent Limitations¶
- Weak state space — Univariate features lack discriminative power
- Noisy rewards — Without external context, reward signals are unreliable
- Policy collapse — Q-values converge to uninformative policies
What Would Help¶
To leverage these architectures effectively, add exogenous variables:
| Variable Type | Examples |
|---|---|
| Correlated series | Related assets, sector indices |
| Macroeconomic | Interest rates, VIX, GDP |
| Sentiment | News sentiment, social media |
| Technical | Volume, bid-ask spread |
Why Tree-Based Models Won¶
Tree-based ensembles (XGBoost, Gradient Boosting, Random Forest) excel because:
- Handcrafted features work well — Statistical features explicitly encode distributional differences
- No feature learning needed — Trees directly use provided features
- Robust to noise — Regularization prevents overfitting
- Ensemble diversity — Different tree configurations capture different patterns
Key Insight
Deep learning needs raw, multi-dimensional input to learn representations. Tree-based models excel at using handcrafted statistical features directly.
Training Time vs. Stability Trade-off¶
| Model | Train Time | Stability | Robust Score | Verdict |
|---|---|---|---|---|
| xgb_core_7features | 40s | 98.0% | 0.606 | Fast & stable |
| xgb_tuned_regularization | 60-185s | 96.3% | 0.715 | Best overall |
| gradient_boost_comprehensive | 179-451s | 82.4% | 0.538 | Overfits |
| meta_stacking_7models | 332-32,000s | 83.8% | 0.538 | Overfits |
| qlearning_bayesian_cpd | 22,662-80,900s | 91.5% | 0.463 | Slow & weak |
Conclusion: More training time does NOT correlate with better generalization. In fact, the longest-training models (meta_stacking, qlearning_bayesian) have the worst robust scores.
Top Performers by Category¶
| Category | Model | Key Metrics |
|---|---|---|
| Best Robust Score | xgb_tuned_regularization | 0.715 robust, 96.3% stable, 60-185s training |
| Fastest Training | xgb_core_7features | 7 features, 40s training, 98% stable |
| Highest Stability | xgb_selective_spectral | 99.7% stability |
| No ML Required | segment_statistics_only | Statistical only, 95.4% stable |
Lowest Performers¶
| Category | Models | Findings |
|---|---|---|
| Low Stability | hypothesis_testing_pure, welch_ttest | 76.3% and 71.9% stability |
| Near-Random AUC | Deep learning models (LSTM, Transformer) | ~0.50 AUC on both datasets |
| Overfitting | gradient_boost_comprehensive, meta_stacking | High Dataset A, dropped on Dataset B |