The Quick Version

A model card is a standardized document that describes what an ML model does, how it was trained, what it’s good at, and where it fails. Think of it as a nutrition label for AI. Google introduced the concept in 2018, and it’s now a de facto standard — Hugging Face requires one for every model on their hub.

Here’s a minimal but complete model card using the Hugging Face format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: mit
language: en
tags:
  - text-classification
  - sentiment-analysis
datasets:
  - sst2
  - amazon_reviews
metrics:
  - accuracy
  - f1
model-index:
  - name: sentiment-classifier-v2
    results:
      - task:
          type: text-classification
        dataset:
          name: SST-2
          type: sst2
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.934
          - name: F1
            type: f1
            value: 0.931
---

# Sentiment Classifier v2

## Model Description
Fine-tuned DistilBERT for binary sentiment classification (positive/negative).
Trained on SST-2 and Amazon product reviews (English only).

## Intended Use
- Classifying customer feedback as positive or negative
- Monitoring brand sentiment in social media posts

## Out-of-Scope Use
- Sarcasm detection (not trained for this)
- Non-English text (English only)
- Classifying text longer than 512 tokens

## Limitations and Biases
- Accuracy drops to 0.78 on informal text (slang, abbreviations)
- Tested for gender bias: no significant difference in error rates across gendered language
- Not tested for racial or cultural bias in product review context

That YAML frontmatter makes the model discoverable on Hugging Face. The markdown sections tell users exactly what they need to know.

Generating Model Cards Programmatically

For teams that train models frequently, generate model cards from training metadata instead of writing them by hand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from datetime import datetime
import json

def generate_model_card(
    model_name: str,
    task: str,
    metrics: dict,
    training_data: list[str],
    limitations: list[str],
    intended_uses: list[str],
    out_of_scope: list[str],
    hyperparameters: dict,
    carbon_kg: float = None,
) -> str:
    """Generate a model card markdown from structured metadata."""

    metrics_table = "| Metric | Value |\n|--------|-------|\n"
    for name, value in metrics.items():
        metrics_table += f"| {name} | {value} |\n"

    limitations_list = "\n".join(f"- {l}" for l in limitations)
    uses_list = "\n".join(f"- {u}" for u in intended_uses)
    oos_list = "\n".join(f"- {o}" for o in out_of_scope)
    data_list = "\n".join(f"- {d}" for d in training_data)

    hp_table = "| Parameter | Value |\n|-----------|-------|\n"
    for k, v in hyperparameters.items():
        hp_table += f"| {k} | {v} |\n"

    card = f"""# {model_name}

## Model Description

**Task:** {task}
**Created:** {datetime.now().strftime('%Y-%m-%d')}
**Framework:** PyTorch + Transformers

## Training Data

{data_list}

## Evaluation Results

{metrics_table}

## Intended Use

{uses_list}

## Out-of-Scope Use

{oos_list}

## Limitations and Biases

{limitations_list}

## Training Hyperparameters

{hp_table}
"""

    if carbon_kg:
        card += f"\n## Environmental Impact\n\nEstimated carbon emissions: {carbon_kg:.2f} kg CO2\n"

    return card

# Generate from your training run
card = generate_model_card(
    model_name="fraud-detector-v3",
    task="Binary classification (fraud/legitimate)",
    metrics={"Accuracy": 0.967, "Precision": 0.891, "Recall": 0.923, "F1": 0.907, "AUC-ROC": 0.984},
    training_data=["Internal transaction logs (Jan 2024 - Dec 2025)", "Synthetic fraud examples (10% of training set)"],
    limitations=[
        "Higher false positive rate on international transactions (12% vs 3% domestic)",
        "Not validated for cryptocurrency transactions",
        "Performance degrades on transactions over $50,000 (underrepresented in training data)",
        "Trained on US/EU transaction patterns only",
    ],
    intended_uses=["Real-time fraud screening for credit card transactions", "Risk scoring for manual review queues"],
    out_of_scope=["Loan approval decisions", "Insurance claim fraud", "Identity verification"],
    hyperparameters={"learning_rate": "2e-5", "batch_size": 256, "epochs": 15, "optimizer": "AdamW"},
    carbon_kg=2.3,
)
print(card)

Hook this into your training pipeline so every model checkpoint automatically gets a model card. Store them alongside the model weights in your model registry.

Bias Testing and Reporting

A model card isn’t credible without bias testing results. Here’s how to run and document fairness evaluations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

def fairness_audit(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    sensitive_attr: np.ndarray,
    attr_name: str,
) -> dict:
    """Compute metrics across subgroups of a sensitive attribute."""
    groups = np.unique(sensitive_attr)
    results = {}

    for group in groups:
        mask = sensitive_attr == group
        n = mask.sum()
        acc = accuracy_score(y_true[mask], y_pred[mask])
        f1 = f1_score(y_true[mask], y_pred[mask], zero_division=0)
        fpr = ((y_pred[mask] == 1) & (y_true[mask] == 0)).sum() / max((y_true[mask] == 0).sum(), 1)

        results[str(group)] = {
            "n_samples": int(n),
            "accuracy": round(acc, 4),
            "f1": round(f1, 4),
            "false_positive_rate": round(float(fpr), 4),
        }

    # Check for disparate impact
    fprs = [r["false_positive_rate"] for r in results.values()]
    max_disparity = max(fprs) - min(fprs) if fprs else 0

    return {
        "attribute": attr_name,
        "subgroups": results,
        "max_fpr_disparity": round(max_disparity, 4),
        "passes_80_percent_rule": min(fprs) / max(max(fprs), 1e-10) >= 0.8,
    }

# Run audit
audit = fairness_audit(
    y_true=test_labels,
    y_pred=predictions,
    sensitive_attr=test_data["age_group"].values,
    attr_name="age_group",
)

# Format for model card
print(f"## Fairness Evaluation: {audit['attribute']}")
print(f"\nMax FPR disparity: {audit['max_fpr_disparity']}")
print(f"Passes 80% rule: {'Yes' if audit['passes_80_percent_rule'] else 'NO — review needed'}")
print("\n| Subgroup | Samples | Accuracy | F1 | FPR |")
print("|----------|---------|----------|-----|-----|")
for group, metrics in audit["subgroups"].items():
    print(f"| {group} | {metrics['n_samples']} | {metrics['accuracy']} | {metrics['f1']} | {metrics['false_positive_rate']} |")

The 80% rule (four-fifths rule) is a common threshold: the selection rate for any group should be at least 80% of the highest group’s rate. If your model fails this test, document it clearly and explain what mitigation steps you’re taking.

Pushing Model Cards to Hugging Face

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    language="en",
    license="apache-2.0",
    library_name="transformers",
    tags=["text-classification", "sentiment"],
    datasets=["sst2"],
    metrics=["accuracy", "f1"],
    eval_results=[
        {
            "task": {"type": "text-classification"},
            "dataset": {"name": "SST-2", "type": "sst2"},
            "metrics": [
                {"name": "Accuracy", "type": "accuracy", "value": 0.934},
            ],
        }
    ],
)

card = ModelCard.from_template(
    card_data,
    model_id="your-org/sentiment-classifier-v2",
    model_description="Fine-tuned DistilBERT for binary sentiment classification.",
    developers="ML Team at YourOrg",
    language="English",
)
card.push_to_hub("your-org/sentiment-classifier-v2")

Common Errors and Fixes

“My model card is too long and nobody reads it”

Put the most important information first: what the model does, its biggest limitation, and who should not use it. Technical details (hyperparameters, training infrastructure) go at the bottom. Most users only read the first two sections.

“I don’t know what biases to test for”

Start with the attributes relevant to your use case. For hiring models: gender, race, age. For content moderation: language variety, cultural context. For medical models: demographic groups in your patient population. If you can’t test for a specific attribute because you don’t have the data, say so explicitly.

“Metrics look great in the card but the model fails in production”

Your evaluation set doesn’t match production data. Add a “Known Failure Cases” section with real examples from production where the model failed. Update the card regularly as you discover new failure modes.

Model card metadata doesn’t render on Hugging Face

The YAML frontmatter must be valid YAML between --- markers at the very top of the file. Validate with python -c "import yaml; yaml.safe_load(open('README.md').read().split('---')[1])".

What Goes in a Good Model Card

The minimum useful model card has these sections: Model Description (what it does), Intended Use (who should use it and for what), Limitations (where it fails), and Evaluation Results (how you measured quality). Everything else — training data details, hyperparameters, carbon footprint, ethical considerations — adds value but isn’t strictly required.

The best model cards are honest about failures. A card that says “accuracy drops to 60% on informal text” is more useful than one that only reports the 95% accuracy on the benchmark.