Architectures encode assumptions
Neural architectures bake in inductive biases: locality and translation invariance for images, temporal ordering for speech, permutation sensitivity for sets. Picking the right family reduces sample complexity and compute compared with forcing a generic model to rediscover structure from scratch.
Feedforward neural networks #
Feedforward networks (multilayer perceptrons) connect each layer to the next without cycles. They are appropriate when inputs can be represented as fixed-length vectors and there is no inherent ordering beyond what you engineer into features—examples include tabular embeddings after preprocessing, or flattened inputs when spatial structure is not exploited.
Depth increases expressive power; width provides capacity per layer. Universal approximation theorems guarantee that sufficiently large shallow networks can represent broad function classes, yet depth often yields more parameter-efficient representations for compositional patterns common in vision and language.
Convolutional Neural Networks (CNNs) #
CNNs are designed for grid-structured data such as images or volumetric scans. Convolution layers apply learnable filters locally, sharing weights across spatial positions—implementing translation equivariance. Pooling and striding increase receptive fields while reducing resolution.
Classic stacks alternate convolutions, nonlinearities, and pooling; modern designs use residual connections, depthwise separable convolutions, and attention bottlenecks. Applications span object detection (R-CNN family, YOLO), segmentation (U-Net), medical imaging, satellite analytics, and video understanding when paired with temporal modules.
Recurrent Neural Networks (RNNs) #
RNNs maintain a hidden state updated at each time step, suited to sequential data: speech frames, sensor traces, text tokens, stock series. They map variable-length inputs to outputs per step or a single summary vector.
Vanilla RNNs struggle with long-range dependencies due to vanishing gradients; practical systems historically moved to gated variants. Bidirectional RNNs consume future and past context when the full sequence is available offline—common in NLP tagging before transformers dominated.
Long Short-Term Memory (LSTM) #
LSTMs introduce memory cells with input, forget, and output gates that regulate information flow, enabling more stable learning over hundreds of steps. GRUs simplify gating with fewer parameters while often matching performance on medium-length sequences.
LSTMs remain relevant for low-latency streaming settings, small-footprint deployment, and industrial time-series forecasting where transformer overhead is unnecessary. They also pair with CNNs for video or with CTC losses for speech alignment.
Transformers: architecture overview #
Transformers replace recurrence with self-attention: each position attends to all others to build contextualized representations. Multi-head attention runs parallel attention mechanisms; positional encodings inject order information. Encoder–decoder models powered translation; decoder-only stacks enable large language models.
Advantages include parallelization across sequence length during training and flexible long-range dependencies. Costs scale quadratically with sequence length in standard attention, motivating sparse, linear, or sliding-window variants for long documents and genomics.
When to use which architecture #
| Data & task | Strong starting points | Notes |
|---|---|---|
| Images / video frames | CNNs, Vision Transformers | ViTs excel with large data; CNNs remain efficient on smaller sets. |
| Short sequences, streaming | LSTM/GRU, 1D CNNs | Lower latency and memory than full attention. |
| Text, code, long-range context | Transformers | Pretrained LLMs dominate; fine-tune or prompt for downstream tasks. |
| Tabular features | MLPs, trees, boosting | Try gradient boosting first; deep nets if embeddings help. |
| Sets / graphs | Graph neural networks, set transformers | Respect relational structure explicitly. |
Practical selection tips
Start from proven baselines in your modality, match model capacity to dataset size, and measure calibration on a holdout that reflects deployment skew.
Hybrid systems
Combine CNN backbones with transformer heads, or use convolutional stem plus attention for efficiency—many production models blend ideas rather than using families in isolation.
No architecture is universally best; constraints—latency, memory, interpretability, and data availability—drive the final design. Understanding these families lets you navigate literature, pretrained checkpoints, and tooling with confidence as models continue to evolve.
Scaling, efficiency, and emerging directions #
As models grow, efficiency becomes a first-class design goal: knowledge distillation from teacher to student networks, quantization to int8 or lower for deployment, pruning structured or unstructured weights, and neural architecture search to tailor depth and width to hardware budgets. Attention alternatives—linear attention, state-space models, and convolutional hybrids—seek to preserve long-range modeling with subquadratic cost for long sequences.
Multimodal architectures combine vision, language, and audio encoders with shared fusion layers, enabling assistants that reason over images and documents together. In robotics, diffusion policies and transformer-based planners sit atop perception stacks. The architectural menu keeps expanding, but the decision framework stays the same: match structure to data geometry, quantify trade-offs on representative workloads, and validate under real deployment constraints—not just offline accuracy.
Educators often teach these families in progression—from perceptrons to CNNs, RNNs, and transformers—because each introduces abstractions that reuse later. Hands-on exercises comparing a small MLP to a convolutional baseline on CIFAR, or an LSTM to a transformer on a text classification task, build intuition for inductive bias faster than equations alone. That experiential contrast is one of the fastest ways to internalize why architecture choice is not cosmetic but consequential.
Keep a checklist when selecting architecture: data modality, sequence length, label noise, latency budget, memory ceiling, and whether pretrained weights exist for your domain. Those six constraints narrow the search space quickly and keep experiments focused on decisions that materially affect outcomes.