Residual Connections Explained: The Trick That Made Deep Networks Trainable

In 2015, He et al. won ImageNet with a 152-layer network — a depth that would have been untrainable with standard architectures a year earlier. The enabling innovation was residual connections (skip connections): a simple architectural addition that reformulated what each layer needs to learn, and in doing so, solved the degradation problem that had made deep networks worse than shallower ones.

The Degradation Problem

A surprising empirical finding from the early 2010s: adding more layers to a trained network does not improve accuracy — it often hurts it. This was not just overfitting; the training error also increased with depth. A 56-layer plain CNN performed worse on CIFAR-10 than a 20-layer plain CNN, even on the training set.

The problem was not vanishing gradients alone. With good initialisation and batch normalisation, gradients could flow back through deep networks. The problem was that very deep networks were harder to optimise — the loss landscape became increasingly non-convex and the gradient signals became noisy and inconsistent over long paths.

The Residual Formulation

The key insight of ResNets: instead of asking a layer to learn a direct mapping H(x), ask it to learn the residual F(x) = H(x) - x. The output of the block is then F(x) + x.

output = F(x) + x

where x is the input (the "skip connection") and F(x) is the residual learned by the layers.

Why does this help? If the optimal transformation for a block is close to the identity (a common case in deep networks where later layers often make small refinements), learning F(x) ≈ 0 is much easier than learning H(x) ≈ x from scratch. Residual connections make the identity mapping the natural baseline that layers refine, rather than a mapping that must be explicitly learned.

Gradient Flow Through Skip Connections

The backward pass benefit of skip connections: the gradient flows through two paths — through the learned layers F(x) and directly through the skip connection. The skip connection provides a gradient highway: even if the gradient through F(x) vanishes, the gradient through the direct path remains intact.

This is why residual networks do not suffer from vanishing gradients in the same way plain networks do. The gradient reaching any layer is at least as large as the gradient from the layers above it through the direct skip path.

Residual Connections Beyond ResNets

The impact of residual connections extends far beyond image classification:

Transformers: the transformer architecture uses residual connections around both the attention sub-layer and the feed-forward sub-layer: output = LayerNorm(x + Sublayer(x)). Without these skip connections, training transformers with 12+ layers is unstable.

DenseNet: extends the residual concept by connecting each layer to all subsequent layers within a dense block, rather than just the immediately preceding layer. This extreme connection density improves gradient flow and feature reuse but increases memory requirements.

U-Net: uses skip connections between the encoder and decoder paths to preserve spatial information that the pooling layers in the encoder would otherwise discard. Essential for image segmentation.

Highway Networks: a generalisation of residual connections where a learned gating mechanism controls how much of the input is passed through vs. transformed.

Pre-Activation vs. Post-Activation Residual Blocks

The original ResNet (He et al., 2015) used the post-activation structure: output = ReLU(F(x) + x). The improved ResNet-v2 (He et al., 2016) uses pre-activation: output = x + F(BN(ReLU(x))).

Pre-activation residual blocks are easier to analyse because the identity path is completely clean — no activation function modifies the skip connection. This leads to better gradient flow and improved performance on very deep networks (1000+ layers).

Practical Implications

Always use residual connections when building networks deeper than 10-15 layers
For transformers, post-LN (normalisation after residual add) vs. pre-LN (normalisation before sub-layer) matters for stability: pre-LN is more stable for very deep transformers
Residual connections add negligible compute cost but dramatically expand the practical depth of trainable networks

Conclusion

Residual connections are one of the most important architectural innovations in deep learning history. By reformulating the learning problem from direct mapping to residual learning, and by providing a clean gradient highway through the network, they enabled the depth at which modern deep networks operate. Their presence in virtually every modern architecture — CNNs, transformers, U-Nets — reflects how fundamental this insight has proven to be.

Keywords: residual connections, ResNet, skip connections, deep learning, gradient flow, residual learning, transformer architecture, deep neural networks, vanishing gradient