Hierarchical Transformer¶

Transformer-based architecture with hierarchical reasoning for structural break detection.

Performance¶

Metric	Value	Rank
ROC AUC	0.5439	19th
F1 Score	0.0000	Failed
Accuracy	0.7030	10th
Recall	0.0000	Failed
Train Time	5,370s	Very Slow

Model Failed

This model predicted all zeros despite 90 minutes of training.

Architecture¶

flowchart TD
    A["📊 Input Features"] --> B["🔄 Input Projection<br/>Linear + Positional Encoding"]

    B --> C["🔁 Hierarchical Processing<br/>(2 cycles)"]

    C --> D1["⬇️ Low-Level<br/>Module"]
    C --> D2["⬆️ High-Level<br/>Module"]
    C --> D3["⬇️ Low-Level<br/>Module"]

    D2 <--> D1
    D2 <--> D3

    D2 --> E["🎯 Classifier<br/>Linear → GELU → Dropout → Linear(2)"]
    E --> F["📈 Output"]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style E fill:#fff3e0
    style F fill:#e8f5e9

Self-Attention Mechanism¶

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Multi-Head Attention¶

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \]

Hyperparameters¶

embedding_dim = 64
h_dim = 64          # High-level hidden dimension
l_dim = 32          # Low-level hidden dimension
num_heads = 4
h_layers = 2        # High-level transformer layers
l_layers = 1        # Low-level transformer layers
dropout = 0.1
max_seq_len = 200
learning_rate = 1e-4
epochs = 50
batch_size = 32

Why It Failed¶

1. Data Hunger¶

Transformers require large datasets to learn meaningful attention patterns. Our dataset was insufficient.

2. Univariate Input¶

Self-attention is most powerful when learning relationships between different input variables. With univariate features, there's limited cross-variable learning opportunity.

3. Class Imbalance Sensitivity¶

Without proper handling, cross-entropy loss pushed all predictions toward the majority class (no break).

4. Isolation Penalty¶

Transformer-only architectures underperform when used in isolation for numerical prediction tasks.

What Would Help¶

Exogenous variables: Multi-dimensional input for attention to learn from
Hybrid architecture: Combine with tree-based models
Class-weighted loss: Address imbalance explicitly
More data: Transformers need large training sets

Usage¶

cd hierarchical_transformer
python main.py --mode train --data-dir /path/to/data --model-path ./model.pt

Near Random Performance

This model achieved near-random AUC (~0.49) despite 90 minutes training time. Included for research comparison.