Hierarchical Transformer¶
Transformer-based architecture with hierarchical reasoning for structural break detection.
Performance¶
| Metric | Value | Rank |
|---|---|---|
| ROC AUC | 0.5439 | 19th |
| F1 Score | 0.0000 | Failed |
| Accuracy | 0.7030 | 10th |
| Recall | 0.0000 | Failed |
| Train Time | 5,370s | Very Slow |
Model Failed
This model predicted all zeros despite 90 minutes of training.
Architecture¶
flowchart TD
A["📊 Input Features"] --> B["🔄 Input Projection<br/>Linear + Positional Encoding"]
B --> C["🔁 Hierarchical Processing<br/>(2 cycles)"]
C --> D1["⬇️ Low-Level<br/>Module"]
C --> D2["⬆️ High-Level<br/>Module"]
C --> D3["⬇️ Low-Level<br/>Module"]
D2 <--> D1
D2 <--> D3
D2 --> E["🎯 Classifier<br/>Linear → GELU → Dropout → Linear(2)"]
E --> F["📈 Output"]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#f3e5f5
style E fill:#fff3e0
style F fill:#e8f5e9
Self-Attention Mechanism¶
Multi-Head Attention¶
Hyperparameters¶
embedding_dim = 64
h_dim = 64 # High-level hidden dimension
l_dim = 32 # Low-level hidden dimension
num_heads = 4
h_layers = 2 # High-level transformer layers
l_layers = 1 # Low-level transformer layers
dropout = 0.1
max_seq_len = 200
learning_rate = 1e-4
epochs = 50
batch_size = 32
Why It Failed¶
1. Data Hunger¶
Transformers require large datasets to learn meaningful attention patterns. Our dataset was insufficient.
2. Univariate Input¶
Self-attention is most powerful when learning relationships between different input variables. With univariate features, there's limited cross-variable learning opportunity.
3. Class Imbalance Sensitivity¶
Without proper handling, cross-entropy loss pushed all predictions toward the majority class (no break).
4. Isolation Penalty¶
Transformer-only architectures underperform when used in isolation for numerical prediction tasks.
What Would Help¶
- Exogenous variables: Multi-dimensional input for attention to learn from
- Hybrid architecture: Combine with tree-based models
- Class-weighted loss: Address imbalance explicitly
- More data: Transformers need large training sets
Usage¶
cd hierarchical_transformer
python main.py --mode train --data-dir /path/to/data --model-path ./model.pt
Near Random Performance
This model achieved near-random AUC (~0.49) despite 90 minutes training time. Included for research comparison.