What Are Transformers? #

Transformers are a family of deep learning architectures introduced in the landmark paper “Attention Is All You Need” (Vaswani et al., 2017). Unlike recurrent neural networks (RNNs) and long short-term memory (LSTM) cells, which process tokens one after another, transformers process entire sequences in parallel using self-attention. This design dramatically improves training efficiency on modern accelerators (GPUs and TPUs) because matrix multiplications dominate the workload and map cleanly to hardware.

At a high level, a transformer block stacks layers of self-attention and position-wise feed-forward networks, wrapped with residual connections and layer normalization. Stacking many such blocks yields models that can represent long-range dependencies between words or image patches without explicitly maintaining a hidden state that is updated sequentially. The result is a flexible blueprint used for machine translation, text generation, image classification, speech, and multimodal tasks.

Key idea

Every position in the sequence can directly attend to every other position in a single layer, weighted by learned relevance scores. Depth adds abstraction; width and heads add capacity.

Self-Attention Explained #

Self-attention maps three learned projections for each input token: queries (Q), keys (K), and values (V). Intuitively, a query asks “which other tokens should I read?” Keys describe “what I offer as an address,” and values carry “what content I contribute if selected.” Attention weights are computed by taking scaled dot products between queries and keys, applying softmax to obtain a probability distribution over positions, and forming a weighted sum of values.

Mathematically, for hidden dimension d, one writes Attention(Q, K, V) = softmax(QKT / √dk) V. The scaling factor √dk prevents dot products from growing too large as dimensionality increases, keeping gradients stable during training. Because softmax is computed across all positions, each token’s representation becomes a mixture of information from the full context, with mixture weights learned end-to-end from data.

Multi-Head Attention #

A single attention map can focus on one type of relationship at a time (for example, syntactic agreement or coreference). Multi-head attention runs several attention mechanisms in parallel, each with its own projection matrices, then concatenates and projects the results. Different heads specialize: one might track local n-gram patterns, another long-distance dependencies, and another semantic relations between entities.

Multi-head attention increases representational capacity without requiring an impractically large single attention matrix. In practice, models use 8–128+ heads depending on size and task; very large language models balance head count with hidden width to maintain throughput during training and inference.

Parallelism

Attention over a full sequence is highly parallelizable, unlike RNN unrolling, which is inherently sequential along time.

Path length

Any two tokens can interact in O(1) layers (per head stack), easing gradient flow for long contexts compared to deep RNN stacks.

Inductive bias

Transformers rely less on locality than convolutions; they learn structure from data, which helps language but can require more data for some vision settings.

Positional Encoding #

Self-attention is permutation-invariant if you only feed token embeddings: swapping two rows in the sequence would yield the same pairwise attention pattern up to swapping. Models therefore inject positional information. The original transformer used fixed sinusoidal encodings added to embeddings; many modern systems use learned positional embeddings or rotary positional embeddings (RoPE), which rotate query and key vectors by position-dependent angles and work well in large language models.

For very long contexts, researchers also explore relative position biases, ALiBi-style slopes, and sparse or linear attention variants to balance quality with compute. The right positional scheme can improve extrapolation to longer sequences than seen during training.

Encoder–Decoder Architecture #

The classical transformer for translation has an encoder and a decoder. The encoder is a stack of self-attention and feed-forward layers that builds contextualized representations of the source sequence. The decoder generates the target token by token; it uses masked self-attention (so positions cannot attend to future tokens) and cross-attention into the encoder outputs to condition on the source.

Encoder-only models (for example, BERT) excel at understanding and classification because bidirectional context is allowed. Decoder-only models (for example, GPT) are autoregressive generators: each step predicts the next token from all previous tokens. Encoder–decoder designs remain common in translation, summarization, and multimodal tasks where a clear separation between conditioning and generation helps.

Why Transformers Revolutionized NLP #

Before transformers, the best NLP systems often combined RNNs or CNNs with attention as an add-on. Making attention the core operator unlocked stable scaling: bigger datasets, wider layers, and longer training produced predictable quality gains. Pre-training on large corpora (masked language modeling, next-token prediction, or span corruption) followed by task-specific fine-tuning became the dominant recipe.

This shift enabled unified models for dozens of benchmarks, reduced bespoke feature engineering, and paved the way for few-shot learning when model and data scale grew further. Transfer learning with transformers turned NLP from a collection of small specialized models into a platform of large foundation models.

BERT, GPT, and Other Transformer Models #

BERT (Bidirectional Encoder Representations from Transformers) uses a deep encoder stack and masked language modeling to learn contextual embeddings. It powers sentence classification, named entity recognition, and semantic search when fine-tuned or distilled.

GPT (Generative Pre-trained Transformer) families are decoder-only models trained to predict the next token. They underpin assistants, code models, and creative writing tools. Variants add instruction tuning, reinforcement learning from human feedback, and tool use.

Other influential architectures include T5 (text-to-text transfer transformer), Vision Transformer (ViT) for images, Whisper for speech, and domain-specific models for protein folding, time series, and tabular data. Together they show that the transformer is not a single model but a reusable pattern: tokenize inputs, add positions, stack attention, and train at scale.

Study checklist

  • Trace how Q, K, V are computed from embeddings and how softmax weights form.
  • Contrast encoder-only, decoder-only, and encoder–decoder stacks.
  • Compare sinusoidal, learned, and rotary positional encodings.
  • Relate pre-training objectives (MLM vs. next-token) to downstream behavior.