Reinforcement Learning Models¶

RL models learn decision policies through trial and error.

Overview¶

Model	ROC AUC	F1 Score	Train Time
qlearning_bayesian_cpd	0.5540	0.0000	80,900s
Q-Learning Models	0.5488	0.0645	58s
DQN Model Selector	0.5474	0.4211	1,787s
qlearning_memory_tabular	0.4986	0.3175	53s

Near Random Performance

All RL models performed near random (0.50 AUC). These approaches failed to learn meaningful patterns from univariate features.

RL agents learn optimal policies based on state-action-reward relationships. With only univariate features:

State space lacks discriminative power — Features don't differentiate break/no-break well
Reward signals are noisy — Without external context, feedback is inconsistent
Q-values converge to uninformative policies — Agent learns to always predict one class

\[ Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)] \]

With weak states, the Q-values for different actions converge to similar values.

Even with function approximation via neural networks, the underlying state representation problem remains.

To make RL viable for structural break detection:

Improvement	Why It Helps
Exogenous variables	Richer state representation
Multi-step rewards	Better credit assignment
Curriculum learning	Start with easier cases
Hybrid approaches	Combine with supervised learning

RL Models Underperformed

Tree-based models achieved higher robust scores than all RL approaches (0.715 vs <0.48).

These models are included to demonstrate that: