Interpreting Neural Network Decisions With Activation Maps
Interpreting Neural Network Decisions With Activation Maps
Neural networks are often described as black boxes — and for high-stakes applications, that description is unacceptable. When a model rejects a loan application, flags a medical image, or filters content, the question "why?" is not optional. Activation maps and attribution methods provide principled answers, highlighting which parts of an input drove the model's decision. This guide covers the most important techniques and their practical limitations.
Class Activation Mapping (CAM)
Class Activation Mapping (Zhou et al., 2016) identifies which spatial regions of an image a CNN uses to classify it. The method takes the global average pooled activation of the final convolutional layer, weights each feature map by the corresponding classification weight, and sums — producing a heatmap that localises discriminative image regions.
CAM requires global average pooling before the classification layer, limiting its applicability to architectures designed with this constraint. Despite this, it remains conceptually central to all gradient-based extensions.
Grad-CAM: Generalising to Any Architecture
Grad-CAM (Gradient-weighted Class Activation Mapping) removes CAM's architectural constraint by replacing the weight vector with the gradient of the class score with respect to feature maps of any target convolutional layer. The gradient tells you how much each activation channel contributes to the target class; the resulting weighted combination produces a localisation heatmap applicable to any CNN.
Grad-CAM works with VGG, ResNet, Inception, and effectively any architecture with convolutional layers. Grad-CAM++ improves localisation of multiple instances; Score-CAM replaces gradients with activation perturbations for a gradient-free alternative.
For practical explainability in vision tasks, Grad-CAM on the last convolutional layer is the default first tool.
Integrated Gradients
Integrated Gradients (Sundararajan et al.) is a theoretically grounded attribution method applicable to any differentiable model. It attributes the prediction to each input feature by integrating the gradient along a path from a baseline input to the actual input.
IG satisfies two important axioms: sensitivity (if changing a feature changes the output, it gets non-zero attribution) and implementation invariance. These properties make IG attributions principled and comparable across models. For tabular and text inputs, IG provides feature-level attributions that are interpretable and theoretically sound.
LIME: Model-Agnostic Local Explanations
LIME explains any model's prediction by fitting a simple interpretable model (typically linear regression) in the neighbourhood of the input being explained. It perturbs the input, observes the model's responses, and fits a surrogate to those responses — producing a local linear approximation that highlights which features drove the specific prediction.
LIME is model-agnostic and works for CNNs, gradient boosted trees, LLMs, or any model. Its limitation is stability — different runs can produce different explanations due to random perturbation sampling.
SHAP: Game-Theoretic Attribution
SHAP (SHapley Additive exPlanations) uses game theory (Shapley values) to assign each feature a contribution score satisfying efficiency, symmetry, dummy, and additivity. For tree-based models, TreeSHAP computes exact Shapley values efficiently. For deep networks, DeepSHAP provides an efficient approximation. SHAP is widely used in financial services, healthcare ML, and any domain requiring rigorous feature attribution.
Limitations
Activation maps show where the model looks — not why it looks there or whether what it sees is causally meaningful. A model can look at the right region for the wrong reason. Attribution methods explain the model's computation, not the underlying phenomenon. They are tools for building understanding and catching alignment failures, not substitutes for causal analysis or domain expert review.
Conclusion
Use Grad-CAM for spatial localisation in CNNs, integrated gradients for principled feature attribution in differentiable models, LIME for model-agnostic local explanation, and SHAP for rigorous additive attribution. Apply all of them with honest awareness of their limitations.
Keywords: Grad-CAM, activation maps, neural network interpretability, SHAP, LIME, integrated gradients, explainable AI, XAI, feature attribution, model explainability