Q-Learning Models¶
Tabular Q-learning agents with discretized state spaces.
Models¶
| Model | ROC AUC | Train Time | Notes |
|---|---|---|---|
| qlearning_rolling_stats | 0.5488 | 58s | Best RL |
| qlearning_memory_tabular | 0.4986 | 53s | With action memory |
| qlearning_bayesian_cpd | 0.5540 | 80,900s | Hybrid, very slow |
Q-Learning Algorithm¶
Bellman Equation¶
\[
Q^*(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s', a') | s, a]
\]
Update Rule¶
\[
Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)]
\]
Epsilon-Greedy Policy¶
\[
a = \begin{cases}
\text{random action} & \text{with probability } \epsilon \\
\arg\max_a Q(s, a) & \text{with probability } 1-\epsilon
\end{cases}
\]
State Representation¶
qlearning_rolling_stats¶
# Discretize first 10 features into 5 bins each
State = (discretized_features[0:10], recent_actions[-3:])
- Features discretized into 5 bins
- Memory of last 3 actions included
Action Space¶
Reward Structure¶
R(action, true_label) = {
+1.0 if action == true_label # Correct
-0.5 if action == 1 and true_label == 0 # False positive
-1.0 if action == 0 and true_label == 1 # Missed break
}
Hyperparameters¶
alpha = 0.2 # Learning rate
gamma = 0.9 # Discount factor
epsilon = 0.15 # Exploration rate
n_bins = 5 # Discretization bins
memory_steps = 3 # Action history length
Probability Conversion¶
Convert Q-values to probabilities via softmax:
\[
P(\text{break}) = \frac{\exp(Q(s, 1))}{\exp(Q(s, 0)) + \exp(Q(s, 1)) + \epsilon}
\]
Why These Models Failed¶
- Discretization loses information — Binning continuous features reduces discriminative power
- Sparse state space — Many states never visited during training
- Reward noise — Without external context, rewards are inconsistent
Usage¶
cd qlearning_rolling_stats
python main.py --mode train --data-dir /path/to/data --model-path ./model.pkl
Near Random Performance
These models achieved near-random AUC (~0.50) and are included for research comparison purposes only.