Batch Normalization vs Layer Normalization: When to Use Which

Normalisation layers are a near-universal ingredient in modern deep learning architectures. They stabilise training, accelerate convergence, and often improve generalisation. But Batch Normalization (BN) and Layer Normalization (LN) normalise over different dimensions and have different properties — making one clearly better than the other for specific contexts. Understanding the distinction prevents the subtle performance issues that arise from applying the wrong normalisation strategy.

How They Differ

Batch Normalization (Ioffe & Szegedy, 2015) normalises each feature across the batch dimension. For a mini-batch of N examples, each with C channels and spatial dimensions H×W, BN normalises each (channel) feature map across all N×H×W positions:

Mean and variance are computed over the batch for each channel
At inference, running statistics accumulated during training replace the batch statistics
Parameters: learned scale (γ) and bias (β) per channel

Layer Normalization (Ba et al., 2016) normalises across the feature dimension for each individual example:

Mean and variance are computed over all features of a single example
No dependence on other examples in the batch — works identically at training and inference
Parameters: learned scale (γ) and bias (β) per feature position

The critical difference: BN's statistics depend on the batch; LN's statistics depend only on the individual example.

When Batch Normalisation Works Best

CNNs on images: BN was designed for convolutional networks processing images in fixed-size mini-batches. With batch sizes of 32-512, the batch statistics are reliable estimates of the population statistics. BN provides strong regularisation (batch noise acts as a regulariser) and consistently improves convergence.

Fixed batch sizes: BN degrades with small batch sizes (< 8). When batch size is 1 or 2, the batch statistics are unreliable. Group Normalisation is preferred for small-batch settings.

Simple feedforward networks: BN's different training/inference behaviour (running stats vs. batch stats) is manageable when the architecture is straightforward.

When Layer Normalisation Works Best

Transformers and NLP: text sequences have variable length; batches contain sequences of different lengths padded to the same length. Normalising across the batch would mix meaningful activations with padding. LN normalises within each token's feature dimension, making it batch-size and sequence-length independent. All major transformer architectures (BERT, GPT, T5) use LN.

Recurrent Networks (RNNs, LSTMs): similar argument — hidden states have different lengths across time steps and batch examples. LN normalises each time step independently.

Batch size 1: LN works perfectly at batch size 1; BN does not. For online inference or architectures that process one example at a time, LN is the only practical choice.

Generative models: normalising across the batch in GANs and VAEs can cause undesirable interactions between examples in the same batch. LN (or Instance Normalisation, which normalises per-channel per-example) is preferred.

Group Normalisation: The Middle Ground

Group Normalization (Wu & He, 2018) divides channels into groups and normalises within each group for each example independently. It is BN-like in structure (operates on channels) but LN-like in batch independence (no batch statistics).

GN is the preferred choice when BN would be appropriate but batch sizes are necessarily small (object detection, medical imaging, where large images constrain batch size to 2-4).

Instance Normalisation

Instance Normalization normalises each channel of each example independently (the most local possible normalisation). It is particularly effective for style transfer tasks where normalising out per-instance style statistics (and then re-injecting them via Adaptive Instance Normalisation) enables style manipulation.

Pre-LN vs. Post-LN in Transformers

For transformers specifically, the placement of LN matters:

Post-LN (original BERT/Transformer): output = LN(x + Sublayer(x)). The residual path passes through LN; this can cause instability in very deep transformers.

Pre-LN (GPT-2 and later): output = x + Sublayer(LN(x)). The residual path is clean; LN is applied within the sub-layer. Pre-LN is more stable for very deep transformers (>12 layers) and is the standard in most modern large language models.

Decision Guide

Architecture	Recommended normalisation
CNN, large batch size	Batch Normalisation
CNN, small batch size (< 8)	Group Normalisation
Transformer / LLM	Layer Normalisation (Pre-LN for depth > 12)
RNN / LSTM	Layer Normalisation
Style transfer	Instance Normalisation
GAN	Instance or Layer Normalisation

Conclusion

Batch Normalisation and Layer Normalisation solve the same instability problem by different means: BN normalises across the batch; LN normalises across features within each example. The right choice is determined by your architecture and batch size constraints — not personal preference. Using LN in a CNN forfeits the regularisation benefits of batch noise; using BN in a transformer creates batch-size dependencies and sequence-length complications. Know which context you are in and choose accordingly.

Keywords: batch normalization, layer normalization, group normalization, normalisation layers, transformer architecture, CNN training, deep learning normalisation, Pre-LN Post-LN