Machine learning

Model Cards Are Not Optional: Building Transparent ML Systems

khaled November 27, 2024 4 mins read
Model Cards Are Not Optional: Building Transparent ML Systems

Model Cards Are Not Optional: Building Transparent ML Systems

In 2019, Google researchers proposed model cards — structured documents that accompany ML models to describe their intended use cases, performance characteristics, and known limitations. At the time, model cards were a research proposal for responsible AI documentation. By 2024, they are required documentation for AI systems deployed in the EU under the AI Act, expected by enterprise buyers as part of procurement due diligence, and considered a minimum standard of responsible deployment by practitioners who have learned the hard way what happens when context about a model's behaviour is not communicated.

What a Model Card Contains

A complete model card addresses six areas:

1. Model details: architecture, training data description, training date, version, license, point of contact.

2. Intended use: the primary use cases the model was designed and evaluated for; explicitly listing out-of-scope uses prevents misapplication.

3. Factors: the groups, instruments, and conditions that affect model performance — demographic groups, geographic regions, data collection conditions, languages.

4. Metrics: the evaluation metrics used and why they were chosen; aggregate metrics alone are insufficient.

5. Evaluation data: description of the evaluation datasets — who collected them, when, under what conditions.

6. Disaggregated evaluation results: performance broken down by the relevant subgroups identified in the Factors section. This is the most important section.

Why Disaggregated Evaluation Matters

Aggregate accuracy conceals disparities. A facial recognition system with 95% overall accuracy may have 70% accuracy on dark-skinned female faces. A loan default predictor with 82% overall AUC may perform significantly differently across racial groups. Reporting only aggregate metrics means deployers and affected communities have no way to know these disparities exist.

Disaggregated evaluation requires intentionally testing across relevant subgroups and being transparent about results that show gaps. This is uncomfortable when results reveal bias — which is precisely why it is valuable. A disparity discovered and disclosed by the model developer, before deployment, is an issue that can be addressed. A disparity discovered in production by a journalist is a crisis.

Model Cards as a Communication Tool

Model cards are not just for external audiences — they serve internal teams:

  • Engineering teams integrating the model need to know its input/output format, expected latency, and failure modes
  • Product teams need to know the intended use cases and what the model should not be used for
  • Legal and compliance teams need documented evidence of due diligence for regulatory compliance
  • Downstream teams reusing or fine-tuning the model need to understand what they are starting from

The EU AI Act and Documentation Requirements

The EU AI Act (phased implementation from 2024-2027) classifies AI systems into risk categories with corresponding documentation requirements. High-risk AI systems — including credit scoring, biometric identification, hiring tools, and critical infrastructure applications — must provide technical documentation covering intended purpose, performance measures, data governance, and bias mitigation measures. Model cards are the natural format for much of this documentation.

Even for lower-risk systems, the AI Act's transparency provisions increasingly require disclosure of AI involvement and minimum documentation. Proactive model card adoption now reduces compliance overhead later.

Writing a Model Card in Practice

A minimal viable model card for an internal ML model:

  1. One-paragraph model description: what does the model do, what algorithm, trained on what data
  2. Intended use cases (3-5 bullets): be specific
  3. Out-of-scope uses (3-5 bullets): what this model should not be used for
  4. Overall evaluation metrics with values and evaluation dataset description
  5. Disaggregated evaluation broken down by the most relevant demographic, geographic, or contextual factors
  6. Known limitations: edge cases where performance degrades; known biases; out-of-distribution failure modes
  7. Training data description: sources, date range, known gaps or biases in the training data

Conclusion

Model cards are documentation that teams should want to write — not because regulators require it, but because the exercise of writing a model card forces explicit thinking about intended use, evaluation rigour, and failure modes. A team that cannot write a model card for their deployed model either does not know enough about their model to deploy it responsibly, or knows things about it that they prefer not to put in writing. Neither is acceptable.

Keywords: model cards, responsible AI, ML documentation, AI transparency, fairness machine learning, EU AI Act, disaggregated evaluation, AI bias documentation