Skip to content

Hierarchical Transformer

Transformer-based architecture with hierarchical reasoning for structural break detection.

Performance

Metric Value Rank
ROC AUC 0.5439 19th
F1 Score 0.0000 Failed
Accuracy 0.7030 10th
Recall 0.0000 Failed
Train Time 5,370s Very Slow

Model Failed

This model predicted all zeros despite 90 minutes of training.

Architecture

flowchart TD
    A["📊 Input Features"] --> B["🔄 Input Projection<br/>Linear + Positional Encoding"]

    B --> C["🔁 Hierarchical Processing<br/>(2 cycles)"]

    C --> D1["⬇️ Low-Level<br/>Module"]
    C --> D2["⬆️ High-Level<br/>Module"]
    C --> D3["⬇️ Low-Level<br/>Module"]

    D2 <--> D1
    D2 <--> D3

    D2 --> E["🎯 Classifier<br/>Linear → GELU → Dropout → Linear(2)"]
    E --> F["📈 Output"]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style E fill:#fff3e0
    style F fill:#e8f5e9

Self-Attention Mechanism

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Multi-Head Attention

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \]

Hyperparameters

embedding_dim = 64
h_dim = 64          # High-level hidden dimension
l_dim = 32          # Low-level hidden dimension
num_heads = 4
h_layers = 2        # High-level transformer layers
l_layers = 1        # Low-level transformer layers
dropout = 0.1
max_seq_len = 200
learning_rate = 1e-4
epochs = 50
batch_size = 32

Why It Failed

1. Data Hunger

Transformers require large datasets to learn meaningful attention patterns. Our dataset was insufficient.

2. Univariate Input

Self-attention is most powerful when learning relationships between different input variables. With univariate features, there's limited cross-variable learning opportunity.

3. Class Imbalance Sensitivity

Without proper handling, cross-entropy loss pushed all predictions toward the majority class (no break).

4. Isolation Penalty

Transformer-only architectures underperform when used in isolation for numerical prediction tasks.

What Would Help

  1. Exogenous variables: Multi-dimensional input for attention to learn from
  2. Hybrid architecture: Combine with tree-based models
  3. Class-weighted loss: Address imbalance explicitly
  4. More data: Transformers need large training sets

Usage

cd hierarchical_transformer
python main.py --mode train --data-dir /path/to/data --model-path ./model.pt

Near Random Performance

This model achieved near-random AUC (~0.49) despite 90 minutes training time. Included for research comparison.