Weight Initialization Matters More Than You Think
Weight initialisation is the step that determines whether your neural network trains smoothly, struggles with vanishing gradients, or explodes in the first backward pass. In the early days of deep learning, before Batch Normalisation and residual connections, bad initialisation was the primary reason deep networks were untrainable. Even today, with these tools available, understanding why initialisation matters and choosing the right scheme for your architecture and activation functions makes the difference between smooth training curves and debugging sessions.
The Problem With Random Initialisation (Naive Version)
The simplest initialisation: sample weights from a standard normal distribution N(0, 1). This seems reasonable — zero mean, unit variance. In practice, this causes exploding activations and exploding gradients in deep networks.
Consider a linear network with 100 layers, each with weight matrix W sampled from N(0, 1). The output of layer k is the product of k random matrices. By the central limit theorem, the variance of the output grows exponentially with depth. After 100 layers, the activations have variance on the order of n^100 where n is the layer width — numerically infinite.
The equally naive fix — initialise with very small weights (N(0, 0.01)) — causes the opposite: vanishing activations. Outputs converge to zero through the layers; gradients vanish during backpropagation; the network cannot learn.
Xavier / Glorot Initialisation
Glorot and Bengio (2010) derived an initialisation scheme that keeps the variance of activations roughly constant across layers for tanh and sigmoid activations. The key insight: for a layer with n_in inputs and n_out outputs, initialise weights from:
Uniform(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))
or equivalently, Normal(0, 2/(n_in + n_out)).
This ensures that the variance of activations and the variance of gradients are both approximately preserved through the layer during forward and backward passes. For deep networks with tanh activations, Xavier initialisation enables training where naive initialisation fails entirely.
In PyTorch: torch.nn.init.xavier_uniform_() and torch.nn.init.xavier_normal_().
He / Kaiming Initialisation
Xavier was derived for symmetric activations with zero mean (like tanh). ReLU activations are not symmetric — they zero out negative values, effectively halving the variance per layer. He et al. (2015) corrected for this:
Normal(0, 2/n_in)
The factor of 2 (vs. 1/(n_in) in a naive approach) compensates for ReLU's variance reduction. He initialisation enables training very deep ReLU networks (50+ layers) without batch normalisation.
In PyTorch: torch.nn.init.kaiming_uniform_() and torch.nn.init.kaiming_normal_(). For Leaky ReLU, the a parameter of the kaiming functions should match the negative slope.
Orthogonal Initialisation
For recurrent networks (RNNs, LSTMs) and very deep networks, orthogonal initialisation — initialising weight matrices as random orthogonal matrices — preserves gradient norms exactly during the forward and backward pass. An orthogonal matrix W satisfies W^T W = I, so its singular values are all 1 and matrix multiplication does not change vector norms.
Orthogonal initialisation helps with training very deep networks even in the absence of batch normalisation and is particularly important for RNNs, where weight matrices are applied repeatedly across time steps.
In PyTorch: torch.nn.init.orthogonal_().
Initialisation and Batch Normalisation
Batch normalisation normalises layer activations to have zero mean and unit variance during training, partially mitigating the impact of poor initialisation. This has led some practitioners to treat initialisation as less important in modern networks.
However, even with batch normalisation, initialisation affects:
- Training speed in the first few epochs: better initialisation means fewer epochs to reach a given loss level
- Networks without batch norm: transformers, architectures with normalisation before (Pre-LN) rather than after (Post-LN) attention blocks, and deployment-optimised small networks
Understanding initialisation remains important for debugging training instabilities and for architectures designed to run without normalisation.
Practical Recommendations
- ReLU / Leaky ReLU activations: He (Kaiming) initialisation, fan_in mode
- Tanh / Sigmoid activations: Xavier (Glorot) initialisation
- RNNs and very deep MLPs: Orthogonal initialisation for weight matrices
- Transformers: typically use slightly modified Xavier, sometimes with scaling factors in specific layers (e.g., output projections scaled by 1/√(2×n_layers))
- Always zero-initialise biases: uniform or normal bias initialisation is unnecessary; zero biases break symmetry naturally through the weight variance
Conclusion
Initialisation shapes the gradient signal from the first backward pass. The right scheme — He for ReLU, Xavier for tanh, orthogonal for recurrent weights — ensures that the variance of activations and gradients is preserved across depth, enabling effective learning from the start. Batch normalisation forgives many initialisation sins, but understanding initialisation is essential for debugging, custom architectures, and networks where normalisation cannot be applied.
Keywords: weight initialization, He initialization, Xavier initialization, Glorot, Kaiming, neural networks, vanishing gradients, ReLU activation, deep learning training