How to Build a Dataset Card Generator for ML Documentation

Dataset cards are the nutrition labels of machine learning. Without them, you’re handing someone a dataset and saying “good luck figuring out what’s in here.” The problem is writing them by hand is tedious, error-prone, and nobody wants to do it. So you automate it.

Here’s the quick version. Given a pandas DataFrame, this function generates a Markdown dataset card with schema info, summary statistics, and missing value counts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import numpy as np
from datetime import datetime

def generate_basic_card(df: pd.DataFrame, dataset_name: str) -> str:
    """Generate a basic dataset card from a DataFrame."""
    lines = [
        f"# Dataset Card: {dataset_name}",
        f"",
        f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
        f"**Rows:** {len(df):,}",
        f"**Columns:** {len(df.columns)}",
        f"",
        f"## Schema",
        f"",
        f"| Column | Type | Non-Null Count | Null % |",
        f"|--------|------|----------------|--------|",
    ]
    for col in df.columns:
        dtype = str(df[col].dtype)
        non_null = df[col].notna().sum()
        null_pct = df[col].isna().mean() * 100
        lines.append(f"| {col} | {dtype} | {non_null:,} | {null_pct:.1f}% |")

    return "\n".join(lines)

That gets you started. Now we’ll build something more complete.

Setting Up Sample Data

You need a realistic DataFrame to test against. We’ll create one that mimics a hiring dataset with demographic columns, numeric scores, and some missing values baked in:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    "applicant_id": range(1, n + 1),
    "age": np.random.randint(22, 65, size=n),
    "gender": np.random.choice(["male", "female", "non-binary"], size=n, p=[0.45, 0.45, 0.10]),
    "ethnicity": np.random.choice(
        ["white", "black", "hispanic", "asian", "other"],
        size=n, p=[0.5, 0.2, 0.15, 0.1, 0.05]
    ),
    "years_experience": np.random.exponential(5, size=n).round(1),
    "interview_score": np.random.normal(70, 15, size=n).round(1),
    "hired": np.random.choice([0, 1], size=n, p=[0.7, 0.3]),
})

# Inject some missing values
mask_age = np.random.random(n) < 0.03
mask_score = np.random.random(n) < 0.05
df.loc[mask_age, "age"] = np.nan
df.loc[mask_score, "interview_score"] = np.nan

This gives you a DataFrame with mixed types, demographic columns for bias checks, and realistic missingness patterns.

Building the Markdown Template Engine

The generator needs to produce several sections: metadata, schema, statistics for numeric columns, and value distributions for categorical columns. Here’s the full template engine:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
from datetime import datetime
from typing import Optional

def compute_statistics_section(df: pd.DataFrame) -> str:
    """Build a Markdown table of descriptive stats for numeric columns."""
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if not numeric_cols:
        return "_No numeric columns found._"

    lines = [
        "| Column | Mean | Std | Min | 25% | 50% | 75% | Max |",
        "|--------|------|-----|-----|-----|-----|-----|-----|",
    ]
    desc = df[numeric_cols].describe().T
    for col in numeric_cols:
        row = desc.loc[col]
        lines.append(
            f"| {col} | {row['mean']:.2f} | {row['std']:.2f} | "
            f"{row['min']:.2f} | {row['25%']:.2f} | {row['50%']:.2f} | "
            f"{row['75%']:.2f} | {row['max']:.2f} |"
        )
    return "\n".join(lines)


def compute_categorical_section(df: pd.DataFrame, max_unique: int = 20) -> str:
    """Build distribution tables for categorical columns."""
    cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
    if not cat_cols:
        return "_No categorical columns found._"

    sections = []
    for col in cat_cols:
        nunique = df[col].nunique()
        if nunique > max_unique:
            sections.append(f"### {col}\n\n_{nunique} unique values (too many to display)._")
            continue

        counts = df[col].value_counts()
        total = counts.sum()
        lines = [
            f"### {col}",
            "",
            f"| Value | Count | Percentage |",
            f"|-------|-------|------------|",
        ]
        for val, count in counts.items():
            pct = count / total * 100
            lines.append(f"| {val} | {count:,} | {pct:.1f}% |")
        sections.append("\n".join(lines))

    return "\n\n".join(sections)


def generate_dataset_card(
    df: pd.DataFrame,
    dataset_name: str,
    description: Optional[str] = None,
    license_info: str = "Not specified",
) -> str:
    """Generate a full dataset card as Markdown."""
    now = datetime.now().strftime("%Y-%m-%d %H:%M")
    total_nulls = df.isna().sum().sum()
    total_cells = df.shape[0] * df.shape[1]
    overall_null_pct = total_nulls / total_cells * 100

    # Schema table
    schema_lines = [
        "| Column | Type | Non-Null | Null % |",
        "|--------|------|----------|--------|",
    ]
    for col in df.columns:
        dtype = str(df[col].dtype)
        non_null = df[col].notna().sum()
        null_pct = df[col].isna().mean() * 100
        schema_lines.append(f"| {col} | {dtype} | {non_null:,} | {null_pct:.1f}% |")

    card = f"""# Dataset Card: {dataset_name}

**Generated:** {now}
**License:** {license_info}

{description or "No description provided."}

## Overview

| Metric | Value |
|--------|-------|
| Rows | {len(df):,} |
| Columns | {len(df.columns)} |
| Total missing values | {total_nulls:,} ({overall_null_pct:.1f}%) |
| Memory usage | {df.memory_usage(deep=True).sum() / 1024:.1f} KB |

## Schema

{chr(10).join(schema_lines)}

## Numeric Statistics

{compute_statistics_section(df)}

## Categorical Distributions

{compute_categorical_section(df)}
"""
    return card

Call it like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
card_md = generate_dataset_card(
    df,
    dataset_name="Hiring Pipeline Dataset v2",
    description="Applicant records from Q3-Q4 2025 hiring pipeline. Contains demographic info, experience, interview scores, and outcomes.",
    license_info="Internal use only",
)

with open("dataset_card.md", "w") as f:
    f.write(card_md)

print(f"Card written: {len(card_md)} characters")

That writes a complete, formatted Markdown file you can drop into a repo or attach to a Hugging Face dataset.

Adding Bias and Fairness Metrics

A dataset card without bias analysis is incomplete. If your dataset has demographic columns and an outcome column, you should compute selection rates and disparate impact ratios:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def compute_bias_metrics(
    df: pd.DataFrame,
    outcome_col: str,
    demographic_cols: list[str],
) -> str:
    """Compute selection rates and disparate impact for demographic groups."""
    sections = []

    for demo_col in demographic_cols:
        if demo_col not in df.columns:
            sections.append(f"### {demo_col}\n\n_Column not found in DataFrame._")
            continue

        grouped = df.groupby(demo_col)[outcome_col].mean()
        max_rate = grouped.max()

        lines = [
            f"### {demo_col} vs {outcome_col}",
            "",
            "| Group | Selection Rate | Disparate Impact Ratio |",
            "|-------|---------------|----------------------|",
        ]

        for group, rate in grouped.items():
            di_ratio = rate / max_rate if max_rate > 0 else 0
            flag = " **" if di_ratio < 0.8 else ""
            lines.append(f"| {group} | {rate:.3f} | {di_ratio:.3f}{flag} |")

        lines.append("")
        lines.append(
            "_Disparate impact ratio below 0.8 may indicate adverse impact "
            "(per the 4/5ths rule). Values marked with ** fall below this threshold._"
        )
        sections.append("\n".join(lines))

    header = "## Bias and Fairness Analysis\n\n"
    return header + "\n\n".join(sections)


# Run it on the sample data
bias_section = compute_bias_metrics(
    df,
    outcome_col="hired",
    demographic_cols=["gender", "ethnicity"],
)
print(bias_section)

The disparate impact ratio compares each group’s selection rate to the highest-performing group. A ratio below 0.8 is the standard threshold from the EEOC’s four-fifths rule. This doesn’t prove discrimination, but it tells you where to look.

To attach this to your card, append it to the output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
full_card = generate_dataset_card(
    df,
    dataset_name="Hiring Pipeline Dataset v2",
    description="Applicant records with demographic data and hiring outcomes.",
)

full_card += "\n" + compute_bias_metrics(
    df, outcome_col="hired", demographic_cols=["gender", "ethnicity"]
)

with open("dataset_card.md", "w") as f:
    f.write(full_card)

Putting It All Together

Here’s the complete generator class that wraps everything into a single interface:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import pandas as pd
import numpy as np
from datetime import datetime
from typing import Optional
from pathlib import Path


class DatasetCardGenerator:
    def __init__(
        self,
        df: pd.DataFrame,
        name: str,
        description: str = "",
        license_info: str = "Not specified",
    ):
        self.df = df
        self.name = name
        self.description = description
        self.license_info = license_info

    def _schema_table(self) -> str:
        lines = [
            "| Column | Type | Non-Null | Null % |",
            "|--------|------|----------|--------|",
        ]
        for col in self.df.columns:
            non_null = self.df[col].notna().sum()
            null_pct = self.df[col].isna().mean() * 100
            lines.append(f"| {col} | {self.df[col].dtype} | {non_null:,} | {null_pct:.1f}% |")
        return "\n".join(lines)

    def _numeric_stats(self) -> str:
        num_cols = self.df.select_dtypes(include=[np.number]).columns
        if len(num_cols) == 0:
            return "_No numeric columns._"
        desc = self.df[num_cols].describe().T
        lines = [
            "| Column | Mean | Std | Min | Median | Max |",
            "|--------|------|-----|-----|--------|-----|",
        ]
        for col in num_cols:
            r = desc.loc[col]
            lines.append(
                f"| {col} | {r['mean']:.2f} | {r['std']:.2f} | "
                f"{r['min']:.2f} | {r['50%']:.2f} | {r['max']:.2f} |"
            )
        return "\n".join(lines)

    def _categorical_distributions(self, max_unique: int = 20) -> str:
        cat_cols = self.df.select_dtypes(include=["object", "category"]).columns
        if len(cat_cols) == 0:
            return "_No categorical columns._"
        parts = []
        for col in cat_cols:
            if self.df[col].nunique() > max_unique:
                parts.append(f"**{col}**: {self.df[col].nunique()} unique values")
                continue
            counts = self.df[col].value_counts()
            total = counts.sum()
            lines = [f"**{col}**", "", "| Value | Count | % |", "|-------|-------|---|"]
            for val, cnt in counts.items():
                lines.append(f"| {val} | {cnt} | {cnt/total*100:.1f}% |")
            parts.append("\n".join(lines))
        return "\n\n".join(parts)

    def _bias_analysis(
        self, outcome_col: str, demographic_cols: list[str]
    ) -> str:
        if outcome_col not in self.df.columns:
            return f"_Outcome column '{outcome_col}' not found._"
        parts = []
        for col in demographic_cols:
            if col not in self.df.columns:
                continue
            rates = self.df.groupby(col)[outcome_col].mean()
            max_rate = rates.max()
            lines = [
                f"**{col}**", "",
                "| Group | Rate | DI Ratio |",
                "|-------|------|----------|",
            ]
            for grp, rate in rates.items():
                di = rate / max_rate if max_rate > 0 else 0.0
                lines.append(f"| {grp} | {rate:.3f} | {di:.3f} |")
            parts.append("\n".join(lines))
        return "\n\n".join(parts) if parts else "_No demographic columns found._"

    def generate(
        self,
        outcome_col: Optional[str] = None,
        demographic_cols: Optional[list[str]] = None,
    ) -> str:
        now = datetime.now().strftime("%Y-%m-%d")
        total_nulls = self.df.isna().sum().sum()
        total_cells = self.df.shape[0] * self.df.shape[1]

        card = f"""# Dataset Card: {self.name}

**Generated:** {now} | **License:** {self.license_info}

{self.description}

## Overview

- **Rows:** {len(self.df):,}
- **Columns:** {len(self.df.columns)}
- **Missing values:** {total_nulls:,} ({total_nulls/total_cells*100:.1f}% of all cells)
- **Memory:** {self.df.memory_usage(deep=True).sum() / 1024:.1f} KB

## Schema

{self._schema_table()}

## Numeric Statistics

{self._numeric_stats()}

## Categorical Distributions

{self._categorical_distributions()}
"""
        if outcome_col and demographic_cols:
            card += f"""
## Bias and Fairness

{self._bias_analysis(outcome_col, demographic_cols)}

_Disparate impact ratio below 0.8 may indicate adverse impact (4/5ths rule)._
"""
        return card

    def save(self, path: str, **kwargs) -> None:
        content = self.generate(**kwargs)
        Path(path).write_text(content)
        print(f"Saved dataset card to {path} ({len(content):,} chars)")


# Usage
generator = DatasetCardGenerator(
    df=df,
    name="Hiring Pipeline Dataset v2",
    description="Applicant records from 2025 hiring cycles with demographic data and outcomes.",
    license_info="CC-BY-4.0",
)

generator.save(
    "dataset_card.md",
    outcome_col="hired",
    demographic_cols=["gender", "ethnicity"],
)

Run this and you get a clean Markdown file with every section a reviewer needs: schema, stats, distributions, and bias flags. Drop it in your repo root or push it alongside a Hugging Face dataset.

Common Errors and Fixes

TypeError: Cannot use .astype() to convert to float when computing stats

This happens when a column looks numeric but contains strings like “N/A” or “-”. Fix it before passing to the generator:

1
df["interview_score"] = pd.to_numeric(df["interview_score"], errors="coerce")

KeyError on the outcome column during bias analysis

Double-check the column name. Pandas is case-sensitive. Use df.columns.tolist() to see exact names:

1
2
print(df.columns.tolist())
# ['applicant_id', 'age', 'gender', 'ethnicity', 'years_experience', 'interview_score', 'hired']

Empty categorical distributions section

This means all your string columns have more unique values than max_unique (default 20). Either raise the threshold or convert high-cardinality columns to categories before generating:

1
2
# Show distribution for columns with up to 50 unique values
gen._categorical_distributions(max_unique=50)

Memory usage shows 0 KB for large DataFrames

You’re probably using a view instead of a copy. The memory_usage(deep=True) call needs to traverse the actual data. If you sliced the DataFrame earlier, make a copy:

1
df_subset = df[["age", "gender", "hired"]].copy()

Markdown tables render broken in GitHub/HF

This usually means a pipe character | exists inside a cell value. Sanitize values before building the table:

1
sanitized = str(val).replace("|", "\\|")

The generator above handles the standard case. For production use, add a sanitization step in each table-building method. You’ll also want to hook this into your CI pipeline so dataset cards get regenerated whenever the underlying data changes.

Setting Up Sample Data#

Building the Markdown Template Engine#

Adding Bias and Fairness Metrics#

Putting It All Together#

Common Errors and Fixes#

Related Guides#

About the Author

Setting Up Sample Data

Building the Markdown Template Engine

Adding Bias and Fairness Metrics

Putting It All Together

Common Errors and Fixes

Related Guides