Q-Learning Models¶

Tabular Q-learning agents with discretized state spaces.

Models¶

Model	ROC AUC	Train Time	Notes
qlearning_rolling_stats	0.5488	58s	Best RL
qlearning_memory_tabular	0.4986	53s	With action memory
qlearning_bayesian_cpd	0.5540	80,900s	Hybrid, very slow

Q-Learning Algorithm¶

Bellman Equation¶

\[ Q^*(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s', a') | s, a] \]

Update Rule¶

\[ Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)] \]

Epsilon-Greedy Policy¶

\[ a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s, a) & \text{with probability } 1-\epsilon \end{cases} \]

State Representation¶

qlearning_rolling_stats¶

# Discretize first 10 features into 5 bins each
State = (discretized_features[0:10], recent_actions[-3:])

Features discretized into 5 bins
Memory of last 3 actions included

Action Space¶

A = {0: no_break, 1: break}

Reward Structure¶

R(action, true_label) = {
    +1.0  if action == true_label  # Correct
    -0.5  if action == 1 and true_label == 0  # False positive
    -1.0  if action == 0 and true_label == 1  # Missed break
}

Hyperparameters¶

alpha = 0.2      # Learning rate
gamma = 0.9      # Discount factor
epsilon = 0.15   # Exploration rate
n_bins = 5       # Discretization bins
memory_steps = 3 # Action history length

Probability Conversion¶

Convert Q-values to probabilities via softmax:

\[ P(\text{break}) = \frac{\exp(Q(s, 1))}{\exp(Q(s, 0)) + \exp(Q(s, 1)) + \epsilon} \]

Why These Models Failed¶

Discretization loses information — Binning continuous features reduces discriminative power
Sparse state space — Many states never visited during training
Reward noise — Without external context, rewards are inconsistent

Usage¶

cd qlearning_rolling_stats
python main.py --mode train --data-dir /path/to/data --model-path ./model.pkl

Near Random Performance

These models achieved near-random AUC (~0.50) and are included for research comparison purposes only.