Skip to content

Q-Learning Models

Tabular Q-learning agents with discretized state spaces.

Models

Model ROC AUC Train Time Notes
qlearning_rolling_stats 0.5488 58s Best RL
qlearning_memory_tabular 0.4986 53s With action memory
qlearning_bayesian_cpd 0.5540 80,900s Hybrid, very slow

Q-Learning Algorithm

Bellman Equation

\[ Q^*(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s', a') | s, a] \]

Update Rule

\[ Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma \max_{a'} Q(s', a') - Q(s, a)] \]

Epsilon-Greedy Policy

\[ a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s, a) & \text{with probability } 1-\epsilon \end{cases} \]

State Representation

qlearning_rolling_stats

# Discretize first 10 features into 5 bins each
State = (discretized_features[0:10], recent_actions[-3:])
  • Features discretized into 5 bins
  • Memory of last 3 actions included

Action Space

A = {0: no_break, 1: break}

Reward Structure

R(action, true_label) = {
    +1.0  if action == true_label  # Correct
    -0.5  if action == 1 and true_label == 0  # False positive
    -1.0  if action == 0 and true_label == 1  # Missed break
}

Hyperparameters

alpha = 0.2      # Learning rate
gamma = 0.9      # Discount factor
epsilon = 0.15   # Exploration rate
n_bins = 5       # Discretization bins
memory_steps = 3 # Action history length

Probability Conversion

Convert Q-values to probabilities via softmax:

\[ P(\text{break}) = \frac{\exp(Q(s, 1))}{\exp(Q(s, 0)) + \exp(Q(s, 1)) + \epsilon} \]

Why These Models Failed

  1. Discretization loses information — Binning continuous features reduces discriminative power
  2. Sparse state space — Many states never visited during training
  3. Reward noise — Without external context, rewards are inconsistent

Usage

cd qlearning_rolling_stats
python main.py --mode train --data-dir /path/to/data --model-path ./model.pkl

Near Random Performance

These models achieved near-random AUC (~0.50) and are included for research comparison purposes only.