Benchmarking¶
Run comprehensive benchmarks to compare detector performance.
Quick Benchmark (Top 5 Detectors)¶
This runs only the top-performing detectors for faster evaluation.
Full Benchmark (All 25 Detectors)¶
Time Required
Full benchmark takes several hours. The meta_stacking_7models alone requires ~9 hours.
Output Format¶
Results are saved as CSV with the following columns:
| Column | Description |
|---|---|
detector |
Model name |
train_time |
Training time in seconds |
eval_time |
Inference time in seconds |
status |
success/error |
TP, FP, TN, FN |
Confusion matrix |
accuracy |
Overall accuracy |
recall |
True positive rate |
f1_score |
Harmonic mean of precision/recall |
roc_auc |
Area under ROC curve |
Understanding the Metrics¶
Primary Metric: ROC AUC¶
ROC AUC measures the model's ability to distinguish between classes across all thresholds. Higher is better.
Secondary Metric: F1 Score¶
F1 balances precision and recall, important for imbalanced datasets like structural break detection.
Metric Selection
- Use ROC AUC for comparing overall discriminative power
- Use F1 Score when false positives are costly (trading applications)
- Use Recall when missing breaks is unacceptable
Benchmark Results Summary¶
| Tier | Models | ROC AUC Range | Notes |
|---|---|---|---|
| Top | xgb_tuned, weighted_ensemble, quad_ensemble | 0.64-0.77 | Best robust scores |
| Good | mlp_ensemble, xgb variants | 0.58-0.68 | Stable performance |
| Moderate | KNN models, segment_statistics | 0.48-0.60 | Variable stability |
| Poor | Deep learning, RL models | 0.46-0.55 | Near random performance |
Custom Benchmarking¶
To benchmark a subset of models: