Skip to content

Detailed Analysis

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

In-depth analysis of cross-dataset experiment results, overfitting patterns, and model behavior.

Performance vs Stability Quadrant

The sweet spot is the upper-right: high robust score AND high stability. Models below the red line (85% stability) are considered unstable.

Complete Results Table

Click on any column header to sort by that metric.

Detector AUC A AUC B AUC Min AUC Stab AUC Robust F1 A F1 B F1 Min F1 Stab F1 Robust
xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715 0.5172 0.5424 0.5172 95.4% 0.493
weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664 0.3000 0.3333 0.3000 90.0% 0.270
quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649 0.3500 0.3721 0.3500 94.1% 0.329
mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647 0.2105 0.3256 0.2105 64.6% 0.136
xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643 0.4444 0.4211 0.4211 94.8% 0.399
xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631 0.4615 0.4800 0.4615 96.1% 0.444
mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607 0.3636 0.4091 0.3636 88.9% 0.323
xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606 0.4675 0.4571 0.4571 97.8% 0.447
xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596 0.3500 0.3556 0.3500 98.4% 0.344
xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584 0.5231 0.3571 0.3571 68.3% 0.244
segment_statistics_only 0.6249 0.5963 0.5963 95.4% 0.569 0.4688 0.4194 0.4194 89.5% 0.375
meta_stacking_7models 0.7662 0.6422 0.6422 83.8% 0.538 0.5417 0.3111 0.3111 57.4% 0.179
gradient_boost_comprehensive 0.7930 0.6533 0.6533 82.4% 0.538 0.4186 0.3721 0.3721 88.9% 0.331
bayesian_bocpd_fused_lasso 0.5005 0.4884 0.4884 97.6% 0.477 0.0625 0.0571 0.0571 91.4% 0.052
wavelet_lstm 0.5249 0.5000 0.5000 95.3% 0.476 0.0000 0.0000 0.0000 N/A 0.000
qlearning_rolling_stats 0.5488 0.5078 0.5078 92.5% 0.470 0.0645 0.0556 0.0556 86.2% 0.048
dqn_base_model_selector 0.5474 0.5067 0.5067 92.6% 0.469 0.4211 0.3951 0.3951 93.8% 0.371
kolmogorov_smirnov_xgb 0.4939 0.5205 0.4939 94.9% 0.469 0.3250 0.3636 0.3250 89.4% 0.290
qlearning_bayesian_cpd 0.5540 0.5067 0.5067 91.5% 0.463 0.0000 0.0000 0.0000 N/A 0.000
hierarchical_transformer 0.5439 0.4862 0.4862 89.4% 0.435 0.0000 0.0000 0.0000 N/A 0.000
qlearning_memory_tabular 0.4986 0.4559 0.4559 91.4% 0.417 0.3175 0.3125 0.3125 98.4% 0.308
knn_wavelet 0.5812 0.4898 0.4898 84.3% 0.413 0.1778 0.2128 0.1778 83.6% 0.149
knn_spectral_fft 0.5793 0.4808 0.4808 83.0% 0.399 0.2051 0.2051 0.2051 100.0% 0.205
welch_ttest 0.4634 0.6444 0.4634 71.9% 0.333 0.0000 0.0000 0.0000 N/A 0.000
hypothesis_testing_pure 0.5394 0.4118 0.4118 76.3% 0.314 0.4167 0.3371 0.3371 80.9% 0.273

Legend:

  • AUC A/B: ROC-AUC on Dataset A/B
  • AUC Min: Minimum AUC across datasets (worst-case)
  • AUC Stab: AUC Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
  • AUC Robust: AUC Robust Score = Min(AUC_A, AUC_B) × Stability
  • F1 A/B: F1 Score on Dataset A/B
  • F1 Min/Stab/Robust: Same calculations for F1 Score

Visual Analysis

Stability Score Ranking

The Core Finding: Overfitting Is Rampant

Our most important discovery is that single-dataset performance is unreliable. Models that appeared to be top performers on Dataset A failed dramatically on Dataset B.

Why Some Models Overfit

Model Likely Cause Evidence
gradient_boost_comprehensive Too many features, insufficient regularization 100+ features, complex ensemble
meta_stacking_7models Model complexity, 7 base models 339 features, 32,000s training
knn_spectral_fft High-dimensional spectral features 152 FFT features
hypothesis_testing_pure Threshold sensitivity Hand-tuned weights

Stability Score Analysis

The Stability Score measures how consistently a model performs across datasets:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

Stability Distribution

Stability Score Distribution:

99-100% │ ██                          (2 models)
95-99%  │ ██████████████              (12 models)
90-95%  │ ██████████                  (6 models)
85-90%  │ ██                          (1 model)
80-85%  │ ████                        (2 models)
< 80%   │ ████                        (2 models)
        └────────────────────────────

Why Regularization Works

The Top Performer: xgb_tuned_regularization

This model uses aggressive regularization:

XGBClassifier(
    max_depth=5,           # Shallow trees (vs 8-12 in others)
    reg_alpha=0.5,         # Strong L1 regularization
    reg_lambda=2.0,        # Strong L2 regularization
    min_child_weight=10    # Larger minimum leaf weight
)

Performance Comparison

Metric xgb_tuned (top performer) gradient_boost (overfit) Difference
Dataset A AUC 0.7423 0.7930 -6.4%
Dataset B AUC 0.7705 0.6533 +17.9%
Stability 96.3% 82.4% +16.9%
Robust Score 0.715 0.538 +32.9%

The heavily regularized model lost 6.4% on the training-like Dataset A but gained 17.9% on the out-of-distribution Dataset B.

Class Imbalance Problem

Structural breaks are rare events. This creates fundamental model bias.

The All-Zeros Problem

Several models achieved ~70% accuracy by predicting "no break" for everything:

Model Dataset A Recall Dataset B Recall Behavior
hierarchical_transformer 0.00 0.00 All zeros
wavelet_lstm 0.00 0.00 All zeros
welch_ttest 0.00 0.00 All zeros
qlearning_bayesian_cpd 0.00 0.00 All zeros

High Accuracy Can Be Misleading

A model predicting all zeros achieves 70% accuracy because ~70% of samples have no break. Always check recall and F1 score.

Cost Asymmetry

Not all errors are equal in structural break detection:

Error Type Description Cost Level
False Negative (FN) Missing a real break Moderate
False Positive (FP) Predicting break when none exists Severe

Why False Positives Are Costly

In trading applications, a false positive triggers unnecessary position changes:

  • Guaranteed transaction fees
  • Slippage costs
  • Potential losses from wrong positioning

A false negative (missed break) represents a missed opportunity, but doesn't incur direct losses.

Metric Implications

This asymmetry means F1 score (which penalizes both FP and FN) is critical for model selection.

Why Deep Learning Models Failed

The Core Problem: Univariate Features Are Insufficient

Deep learning architectures (Transformers, LSTMs, RL agents) are designed to learn relationships between multiple input variables. With only:

  • A single univariate time series
  • Statistical features derived from that same series

...these models lack the rich, multi-dimensional input space needed to learn meaningful patterns.

Cross-Dataset Confirmation

The failure is consistent across BOTH datasets, confirming it's fundamental:

Model Dataset A Dataset B Interpretation
hierarchical_transformer 0.5439 0.4862 Near random
wavelet_lstm 0.5249 0.5000 Near random
dqn_base_model_selector 0.5474 0.5067 Near random
qlearning_bayesian_cpd 0.5540 0.5067 Near random

LSTM Limitations

  1. Long-term dependency issues — Fail to memorize long sequential information
  2. Static forget gates — Cannot revise storage decisions dynamically
  3. Regime shift blindness — Cannot anticipate breaks without external event data

Transformer Limitations

  1. Data hunger — Require large datasets for meaningful attention patterns
  2. Isolation penalty — Underperform when used alone for numerical prediction
  3. Class imbalance sensitivity — Cross-entropy loss pushes toward majority class

RL Agent Limitations

  1. Weak state space — Univariate features lack discriminative power
  2. Noisy rewards — Without external context, reward signals are unreliable
  3. Policy collapse — Q-values converge to uninformative policies

What Would Help

To leverage these architectures effectively, add exogenous variables:

Variable Type Examples
Correlated series Related assets, sector indices
Macroeconomic Interest rates, VIX, GDP
Sentiment News sentiment, social media
Technical Volume, bid-ask spread

Why Tree-Based Models Won

Tree-based ensembles (XGBoost, Gradient Boosting, Random Forest) excel because:

  1. Handcrafted features work well — Statistical features explicitly encode distributional differences
  2. No feature learning needed — Trees directly use provided features
  3. Robust to noise — Regularization prevents overfitting
  4. Ensemble diversity — Different tree configurations capture different patterns

Key Insight

Deep learning needs raw, multi-dimensional input to learn representations. Tree-based models excel at using handcrafted statistical features directly.

Training Time vs. Stability Trade-off

Model Train Time Stability Robust Score Verdict
xgb_core_7features 40s 98.0% 0.606 Fast & stable
xgb_tuned_regularization 60-185s 96.3% 0.715 Best overall
gradient_boost_comprehensive 179-451s 82.4% 0.538 Overfits
meta_stacking_7models 332-32,000s 83.8% 0.538 Overfits
qlearning_bayesian_cpd 22,662-80,900s 91.5% 0.463 Slow & weak

Conclusion: More training time does NOT correlate with better generalization. In fact, the longest-training models (meta_stacking, qlearning_bayesian) have the worst robust scores.

Top Performers by Category

Category Model Key Metrics
Best Robust Score xgb_tuned_regularization 0.715 robust, 96.3% stable, 60-185s training
Fastest Training xgb_core_7features 7 features, 40s training, 98% stable
Highest Stability xgb_selective_spectral 99.7% stability
No ML Required segment_statistics_only Statistical only, 95.4% stable

Lowest Performers

Category Models Findings
Low Stability hypothesis_testing_pure, welch_ttest 76.3% and 71.9% stability
Near-Random AUC Deep learning models (LSTM, Transformer) ~0.50 AUC on both datasets
Overfitting gradient_boost_comprehensive, meta_stacking High Dataset A, dropped on Dataset B