Transformer++
People have tweaked the Transformer architecture enough in 7 years that we’re apparently now calling the current recipe “Transformer++”.
The changes between Vaswani et al 2017 and Radford et al 2019 are well-known (see Figure 3 here): removing the encoder, clipping gradients, not using dropout, and changing the activation function to something weighted or gated. So I focus on post-GPT-2 tweaks.
I also focus on architecture changes rather than data changes (curation or generation) or training changes (infrastructure and hyperparameters) or post-training or inference optimisations (e.g. Various kinds of data parallelism and model parallelism across multiple devices arose largely after 2019). I won’t get into multimodal architectures.
Inclusion criterion: three strong open-source architectures from 2024 using the tweak (LLaMA, Gemma, Qwen2.5, DeepSeek-V2, Hunyuan).
(The term “Transformer++” was coined in this sense by Gu and Dao 2023, who also noted several of the main tweaks.)
The Transformer++
Let the “Transformer++” be a Transformer with
- A fused attention implementation (the scaled dot-product backend -> FlashAttention). Subquadratic memory complexity in input sequence length. Practically: can double GPU utilization and so halve training time. Also enables longer contexts and speeds up inference on long context input.
- Rotary position embedding (sinusoidal -> learned APE -> RoPE)
- Removing attention’s redundant key heads and value heads (vanilla MHA -> MQA -> GQA)
- Regularized / preconditioned optimizer (Adam -> AdamW -> SOAP)
- Normalise before each layer (post LayerNorm -> pre LayerNorm)
- When doing layer normalization: just rescale, don’t centre (LayerNorm -> RMSNorm)
- Divine activation function for the MLP (GeLU -> … -> SwiGLU or GeGLU)
- Tied embeddings. An oldie but goodie.
- Fix logit drift (query/key normalization)
- Fixing that one softmax off-by-one (fixed in some places around 2021)
Less well-established tweaks
- Sparsification. I could mention the turn to sparse Mixtures of Experts. But this turn was more of a cost-saving thing. LLaMA is still dense though.
- BPE -> Tiktoken / SentencePiece. Basically the same, but about 25% better compression.
- Quantization. FP16 to BF16 to int8.
- No bias on QKV projection or layernorm. Thus not sure about putting the biases back into attention, but various people use it.
- Sliding Window Attention e.g. Rolling Buffer Cache
- Cross-Layer Attention shrinks the KV cache
- WARP
- Regularizing outputs (“soft-capping logits”)
Top open architectures by tweak
Component | Tweak | LLaMA 3 | Gemma 2 | Qwen2.5 | DeepSeek-V2 | Hunyuan-Large |
---|---|---|---|---|---|---|
Attention | Attention kernel | FlashAttention-2? | Eager attention | FlashAttention-2 | FlashAttention-2 | FlashAttention-2 |
Attention | Sliding window attention | No? | Local-Global SWA | both | No? | No? |
Attention | Removing KV heads | GQA | GQA | GQA | MLA | GQA |
Attention | Cross-Layer Attention | No | No | No | No | CLA |
Attention | prefill KV cache | Yes | ? | ? | No? | ? |
Attention | low-rank KV cache compression | No | No | No | Yes | No |
Attention | Biases in QKV projection | No? | ? | QKV biases | ? | No |
Attention | QK Normalization | No? | No? | ? | No? | No |
Block sequence | Parallel layers | No | No? | No? | No? | No? |
Embedding | Position encoding | RoPE | RoPE | RoPE | decoupled RoPE | DynamicNTKRope (*6) |
Embedding | Tied embeddings | “Shared” (*1) | Tied | Tied (*4) | ? | Tied (*7) |
Optimizer | Regularized / preconditioned | AdamW | AdamW(*3) | ? | AdamW | AdamW |
Activation normalization | post or pre layernorm | pre | both | ? | pre | ? |
Activation normalization | Don’t center | RMSNorm(*2) | RMSNorm | RMSNorm | RMSNorm | RMSNorm |
Output normalization | Soft-capped logits | No? | Soft-capped logits | No? | No? | No? |
Activation function | Gated linear unit | SwiGLU | GeGLU | SwiGLU | SwiGLU | SwiGLU (*5) |
Sparsification | Sparse? | Dense | Dense? | Dense | MoE | MoE |
Weights quantization | BF16 training | Yes | No, FP32 | Yes | Yes | Yes |
Weights quantization | 8-bit post-training | In one version | No | No | No | In one version |
[1] In the 3.2 models anyway
[2] Llama 2 uses RMSNorm anyway [3] They recommend AdamW for fine-tuning, unsure for training
[4] Only the smaller models
[5] Code says “silu”
[6] “Credits to the Reddit users /u/bloc97 and /u/emozilla”
[7] https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Pretrain/modeling_hunyuan.py#L1419
Caveats
- The above ignores the much more important changes since 2017 to data “collection” (curation and synthesis), cluster infrastructure, post-training, and scaffolding.
- The public tokenizers still use byte-pair encoding
- Some models have absurdly high embedding-parameter counts. This is unlikely to be a performance optimisation. Instead we conjecture this is a tradeoff to allow underreporting the Transformer-parameter count and so enter a lesser model class (“7B”).
- A lot of this doesn’t improve absolute performance that much, but it does make it a lot cheaper to run.
- And this is just the public architecture. And probably there are some public methods which we haven’t realised are improvements yet.
I thank Kushal Thaman for helpful comments.
See also
- https://arxiv.org/html/2410.16682v1
- https://openreview.net/forum?id=d8w0pmvXbZ
Bibliography
- Ainslie, Joshua et. al. (2023). “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” https://arxiv.org/abs/2305.13245
- Bannier, P.A. (2023). “Rolling buffer cache” https://github.com/ggerganov/llama.cpp/discussions/3581
- Biderman, Stella et. al. (2021). “Rotary Embeddings: A Relative Revolution” https://blog.eleuther.ai/rotary-embeddings/
- Brandon, William; Mishra, Mayank; Nrusimha, Aniruddha; Panda, Rameswar and Kelley, Jonathan Ragan (2024). “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention” https://arxiv.org/abs/2405.12981
- Chowdhery, Aakanksha et. al. (2022). “PaLM: Scaling Language Modeling with Pathways” https://arxiv.org/abs/2204.02311
- Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri and Christopher Ré (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” https://arxiv.org/abs/2205.14135
- Dehghani, Mostafa et. al. (2023). “Scaling Vision Transformers to 22 Billion Parameters” https://arxiv.org/abs/2302.05442
- Gu, Albert and Dao, Tri (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” https://arxiv.org/abs/2312.00752
- Leech, Gavin; Garfinkel, Simson; Yagudin, Misha; Briand, Alexander and Zhuralev, Aleksandr (2024). “Ten Hard Problems in Artificial Intelligence We Must Get Right” https://arxiv.org/abs/2402.04464
- Liu, Aixin et. al. (2024). “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” https://arxiv.org/abs/2405.04434
- Liu, Yuxi (2024). “Mixture of Experts” https://yuxi-liu-wired.github.io/essays/posts/mixture-of-experts/
- Loshchilov, Ilya and Frank Hutter (2019). “Decoupled Weight Decay Regularization” https://arxiv.org/abs/1711.05101v3
- Miller, Evan (2023). “Attention Is Off By One” lhttps://www.evanmiller.org/attention-is-off-by-one.html
- Press, Ofir (2023). “Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation” https://github.com/ofirpress/attention_with_linear_biases/#faq
- Radford, Alec; Wu Jeffrey; Amodei, Dario; Sutskever, Ilya et. al. (2019). “Language Models are Unsupervised Multitask Learners” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Roldán, J. Carlos (2024). “What is SwiGLU?” https://jcarlosroldan.com/post/348/what-is-swiglu
- Shazeer, Noam (2019). “Fast Transformer Decoding: One Write-Head is All You Need” https://arxiv.org/abs/1911.02150
- Stack Overflow (2024). “How to implement SwiGLU activation? Why does SwiGLU takes in two tensors?” https://stackoverflow.com/questions/79047727/how-to-implement-swiglu-activation-why-does-swiglu-takes-in-two-tensors
- Sun, Xingwu et. al. (2024). “Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent” https://arxiv.org/abs/2411.02265?utm_source=substack\&utm_medium=email
- Vaswani, Ashish et. al. (2017). “Attention Is All You Need” https://arxiv.org/abs/1706.03762
- Vyas, Nikhil et. al. (2024). “SOAP: Improving and Stabilizing Shampoo using Adam” https://arxiv.org/abs/2409.11321
- Xiong, Riubin et. al. (2020). “On Layer Normalization in the Transformer Architecture” https://arxiv.org/abs/2002.04745
- Zhang, Biao and Sennrich, Rico (2019). “Root Mean Square Layer Normalization” https://arxiv.org/abs/1910.07467