Detailed Analysis¶

Note on Results

These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

In-depth analysis of cross-dataset experiment results, overfitting patterns, and model behavior.

Performance vs Stability Quadrant¶

The sweet spot is the upper-right: high robust score AND high stability. Models below the red line (85% stability) are considered unstable.

Complete Results Table¶

Click on any column header to sort by that metric.

Detector	AUC A	AUC B	AUC Min	AUC Stab	AUC Robust	F1 A	F1 B	F1 Min	F1 Stab	F1 Robust
xgb_tuned_regularization	0.7423	0.7705	0.7423	96.3%	0.715	0.5172	0.5424	0.5172	95.4%	0.493
weighted_dynamic_ensemble	0.6742	0.6849	0.6742	98.4%	0.664	0.3000	0.3333	0.3000	90.0%	0.270
quad_model_ensemble	0.6756	0.6622	0.6622	98.0%	0.649	0.3500	0.3721	0.3500	94.1%	0.329
mlp_ensemble_deep_features	0.7122	0.6787	0.6787	95.3%	0.647	0.2105	0.3256	0.2105	64.6%	0.136
xgb_selective_spectral	0.6451	0.6471	0.6451	99.7%	0.643	0.4444	0.4211	0.4211	94.8%	0.399
xgb_70_statistical	0.6685	0.6493	0.6493	97.1%	0.631	0.4615	0.4800	0.4615	96.1%	0.444
mlp_xgb_simple_blend	0.6746	0.6399	0.6399	94.9%	0.607	0.3636	0.4091	0.3636	88.9%	0.323
xgb_core_7features	0.6188	0.6315	0.6188	98.0%	0.606	0.4675	0.4571	0.4571	97.8%	0.447
xgb_30f_fast_inference	0.6282	0.6622	0.6282	94.9%	0.596	0.3500	0.3556	0.3500	98.4%	0.344
xgb_importance_top15	0.6723	0.6266	0.6266	93.2%	0.584	0.5231	0.3571	0.3571	68.3%	0.244
segment_statistics_only	0.6249	0.5963	0.5963	95.4%	0.569	0.4688	0.4194	0.4194	89.5%	0.375
meta_stacking_7models	0.7662	0.6422	0.6422	83.8%	0.538	0.5417	0.3111	0.3111	57.4%	0.179
gradient_boost_comprehensive	0.7930	0.6533	0.6533	82.4%	0.538	0.4186	0.3721	0.3721	88.9%	0.331
bayesian_bocpd_fused_lasso	0.5005	0.4884	0.4884	97.6%	0.477	0.0625	0.0571	0.0571	91.4%	0.052
wavelet_lstm	0.5249	0.5000	0.5000	95.3%	0.476	0.0000	0.0000	0.0000	N/A	0.000
qlearning_rolling_stats	0.5488	0.5078	0.5078	92.5%	0.470	0.0645	0.0556	0.0556	86.2%	0.048
dqn_base_model_selector	0.5474	0.5067	0.5067	92.6%	0.469	0.4211	0.3951	0.3951	93.8%	0.371
kolmogorov_smirnov_xgb	0.4939	0.5205	0.4939	94.9%	0.469	0.3250	0.3636	0.3250	89.4%	0.290
qlearning_bayesian_cpd	0.5540	0.5067	0.5067	91.5%	0.463	0.0000	0.0000	0.0000	N/A	0.000
hierarchical_transformer	0.5439	0.4862	0.4862	89.4%	0.435	0.0000	0.0000	0.0000	N/A	0.000
qlearning_memory_tabular	0.4986	0.4559	0.4559	91.4%	0.417	0.3175	0.3125	0.3125	98.4%	0.308
knn_wavelet	0.5812	0.4898	0.4898	84.3%	0.413	0.1778	0.2128	0.1778	83.6%	0.149
knn_spectral_fft	0.5793	0.4808	0.4808	83.0%	0.399	0.2051	0.2051	0.2051	100.0%	0.205
welch_ttest	0.4634	0.6444	0.4634	71.9%	0.333	0.0000	0.0000	0.0000	N/A	0.000
hypothesis_testing_pure	0.5394	0.4118	0.4118	76.3%	0.314	0.4167	0.3371	0.3371	80.9%	0.273

Legend:

AUC A/B: ROC-AUC on Dataset A/B
AUC Min: Minimum AUC across datasets (worst-case)
AUC Stab: AUC Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
AUC Robust: AUC Robust Score = Min(AUC_A, AUC_B) × Stability
F1 A/B: F1 Score on Dataset A/B
F1 Min/Stab/Robust: Same calculations for F1 Score

Visual Analysis¶

Stability Score Ranking¶

The Core Finding: Overfitting Is Rampant¶

Our most important discovery is that single-dataset performance is unreliable. Models that appeared to be top performers on Dataset A failed dramatically on Dataset B.

Why Some Models Overfit¶

Model	Likely Cause	Evidence
gradient_boost_comprehensive	Too many features, insufficient regularization	100+ features, complex ensemble
meta_stacking_7models	Model complexity, 7 base models	339 features, 32,000s training
knn_spectral_fft	High-dimensional spectral features	152 FFT features
hypothesis_testing_pure	Threshold sensitivity	Hand-tuned weights

Stability Score Analysis¶

The Stability Score measures how consistently a model performs across datasets:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

Stability Distribution¶

Stability Score Distribution:

99-100% │ ██                          (2 models)
95-99%  │ ██████████████              (12 models)
90-95%  │ ██████████                  (6 models)
85-90%  │ ██                          (1 model)
80-85%  │ ████                        (2 models)
< 80%   │ ████                        (2 models)
        └────────────────────────────

Why Regularization Works¶

The Top Performer: xgb_tuned_regularization¶

This model uses aggressive regularization:

XGBClassifier(
    max_depth=5,           # Shallow trees (vs 8-12 in others)
    reg_alpha=0.5,         # Strong L1 regularization
    reg_lambda=2.0,        # Strong L2 regularization
    min_child_weight=10    # Larger minimum leaf weight
)

Performance Comparison¶

Metric	xgb_tuned (top performer)	gradient_boost (overfit)	Difference
Dataset A AUC	0.7423	0.7930	-6.4%
Dataset B AUC	0.7705	0.6533	+17.9%
Stability	96.3%	82.4%	+16.9%
Robust Score	0.715	0.538	+32.9%

The heavily regularized model lost 6.4% on the training-like Dataset A but gained 17.9% on the out-of-distribution Dataset B.

Class Imbalance Problem¶

Structural breaks are rare events. This creates fundamental model bias.

The All-Zeros Problem¶

Several models achieved ~70% accuracy by predicting "no break" for everything:

Model	Dataset A Recall	Dataset B Recall	Behavior
hierarchical_transformer	0.00	0.00	All zeros
wavelet_lstm	0.00	0.00	All zeros
welch_ttest	0.00	0.00	All zeros
qlearning_bayesian_cpd	0.00	0.00	All zeros

High Accuracy Can Be Misleading

A model predicting all zeros achieves 70% accuracy because ~70% of samples have no break. Always check recall and F1 score.

Cost Asymmetry¶

Not all errors are equal in structural break detection:

Error Type	Description	Cost Level
False Negative (FN)	Missing a real break	Moderate
False Positive (FP)	Predicting break when none exists	Severe

Why False Positives Are Costly¶

In trading applications, a false positive triggers unnecessary position changes:

Guaranteed transaction fees
Slippage costs
Potential losses from wrong positioning

A false negative (missed break) represents a missed opportunity, but doesn't incur direct losses.

Metric Implications

This asymmetry means F1 score (which penalizes both FP and FN) is critical for model selection.

Why Deep Learning Models Failed¶

The Core Problem: Univariate Features Are Insufficient¶

Deep learning architectures (Transformers, LSTMs, RL agents) are designed to learn relationships between multiple input variables. With only:

A single univariate time series
Statistical features derived from that same series

...these models lack the rich, multi-dimensional input space needed to learn meaningful patterns.

Cross-Dataset Confirmation¶

The failure is consistent across BOTH datasets, confirming it's fundamental:

Model	Dataset A	Dataset B	Interpretation
hierarchical_transformer	0.5439	0.4862	Near random
wavelet_lstm	0.5249	0.5000	Near random
dqn_base_model_selector	0.5474	0.5067	Near random
qlearning_bayesian_cpd	0.5540	0.5067	Near random

LSTM Limitations¶

Long-term dependency issues — Fail to memorize long sequential information
Static forget gates — Cannot revise storage decisions dynamically
Regime shift blindness — Cannot anticipate breaks without external event data

Transformer Limitations¶

Data hunger — Require large datasets for meaningful attention patterns
Isolation penalty — Underperform when used alone for numerical prediction
Class imbalance sensitivity — Cross-entropy loss pushes toward majority class

RL Agent Limitations¶

Weak state space — Univariate features lack discriminative power
Noisy rewards — Without external context, reward signals are unreliable
Policy collapse — Q-values converge to uninformative policies

What Would Help¶

To leverage these architectures effectively, add exogenous variables:

Variable Type	Examples
Correlated series	Related assets, sector indices
Macroeconomic	Interest rates, VIX, GDP
Sentiment	News sentiment, social media
Technical	Volume, bid-ask spread

Why Tree-Based Models Won¶

Tree-based ensembles (XGBoost, Gradient Boosting, Random Forest) excel because:

Handcrafted features work well — Statistical features explicitly encode distributional differences
No feature learning needed — Trees directly use provided features
Robust to noise — Regularization prevents overfitting
Ensemble diversity — Different tree configurations capture different patterns

Key Insight

Deep learning needs raw, multi-dimensional input to learn representations. Tree-based models excel at using handcrafted statistical features directly.

Training Time vs. Stability Trade-off¶

Model	Train Time	Stability	Robust Score	Verdict
xgb_core_7features	40s	98.0%	0.606	Fast & stable
xgb_tuned_regularization	60-185s	96.3%	0.715	Best overall
gradient_boost_comprehensive	179-451s	82.4%	0.538	Overfits
meta_stacking_7models	332-32,000s	83.8%	0.538	Overfits
qlearning_bayesian_cpd	22,662-80,900s	91.5%	0.463	Slow & weak

Conclusion: More training time does NOT correlate with better generalization. In fact, the longest-training models (meta_stacking, qlearning_bayesian) have the worst robust scores.

Top Performers by Category¶

Category	Model	Key Metrics
Best Robust Score	xgb_tuned_regularization	0.715 robust, 96.3% stable, 60-185s training
Fastest Training	xgb_core_7features	7 features, 40s training, 98% stable
Highest Stability	xgb_selective_spectral	99.7% stability
No ML Required	segment_statistics_only	Statistical only, 95.4% stable

Lowest Performers¶

Category	Models	Findings
Low Stability	hypothesis_testing_pure, welch_ttest	76.3% and 71.9% stability
Near-Random AUC	Deep learning models (LSTM, Transformer)	~0.50 AUC on both datasets
Overfitting	gradient_boost_comprehensive, meta_stacking	High Dataset A, dropped on Dataset B