How to Build AI Font Generation with Diffusion Models

Diffusion models are surprisingly good at generating stylized lettering. The trick is to not generate letters from scratch – you feed in a template letter rendered in a plain font and let the img2img pipeline restyle it. This keeps the letterform recognizable while applying wild visual styles: medieval calligraphy, neon glow, carved stone, whatever you can describe in a prompt.

Here’s the minimal setup to stylize a single letter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image, ImageDraw, ImageFont
import torch

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Create a template letter
img = Image.new("RGB", (512, 512), "black")
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 300)
bbox = draw.textbbox((0, 0), "A", font=font)
x = (512 - (bbox[2] - bbox[0])) // 2 - bbox[0]
y = (512 - (bbox[3] - bbox[1])) // 2 - bbox[1]
draw.text((x, y), "A", fill="white", font=font)

# Stylize with img2img
result = pipe(
    prompt="medieval illuminated manuscript capital letter A, gold leaf, intricate borders, parchment background",
    image=img,
    strength=0.7,
    guidance_scale=7.5,
    num_inference_steps=30,
).images[0]

result.save("stylized_A.png")

Install the dependencies first: pip install diffusers transformers torch pillow accelerate.

Creating Template Letter Images

The quality of your templates directly affects the output. You want high-contrast, centered letters on a black background. White text on black works best because Stable Diffusion’s img2img pipeline preserves the bright regions as anchor points.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path


def create_letter_template(
    letter: str,
    size: int = 512,
    font_path: str = "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf",
    font_size: int = 300,
) -> Image.Image:
    """Render a single letter centered on a black background."""
    img = Image.new("RGB", (size, size), "black")
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype(font_path, font_size)

    bbox = draw.textbbox((0, 0), letter, font=font)
    text_width = bbox[2] - bbox[0]
    text_height = bbox[3] - bbox[1]
    x = (size - text_width) // 2 - bbox[0]
    y = (size - text_height) // 2 - bbox[1]

    draw.text((x, y), letter, fill="white", font=font)
    return img


def generate_alphabet_templates(output_dir: str = "templates") -> dict[str, Image.Image]:
    """Generate template images for A-Z."""
    Path(output_dir).mkdir(exist_ok=True)
    templates = {}
    for char_code in range(65, 91):  # A-Z
        letter = chr(char_code)
        img = create_letter_template(letter)
        img.save(f"{output_dir}/{letter}.png")
        templates[letter] = img
    return templates


templates = generate_alphabet_templates()
print(f"Generated {len(templates)} letter templates")

A few things matter here. The font size of 300 at 512x512 resolution gives the letter enough room to breathe while filling most of the frame. If the letter is too small, the diffusion model treats it as a minor detail and ignores it. Too large and it clips at the edges.

Bold or heavy fonts work better than thin ones. The thicker strokes give the model more surface area to apply texture and detail. DejaVu Sans Bold ships with most Linux distributions, but any system bold font works fine.

Stylizing with Img2Img

The strength parameter is the single most important control you have. It determines how much the model deviates from your template:

0.3-0.5: Subtle texture changes. The letter shape stays almost identical. Good for adding surface materials like metal or wood grain.
0.5-0.7: Moderate stylization. The letter is clearly recognizable but gains decorative elements, serifs, or artistic flourishes.
0.7-0.9: Heavy transformation. The basic shape is a suggestion. Works for fantasy scripts and abstract lettering.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image, ImageDraw, ImageFont
import torch
from pathlib import Path


def stylize_letter(
    pipe: StableDiffusionImg2ImgPipeline,
    template: Image.Image,
    prompt: str,
    strength: float = 0.65,
    guidance_scale: float = 7.5,
    negative_prompt: str = "blurry, low quality, distorted, multiple letters",
) -> Image.Image:
    """Apply a style to a template letter image."""
    result = pipe(
        prompt=prompt,
        image=template,
        strength=strength,
        guidance_scale=guidance_scale,
        negative_prompt=negative_prompt,
        num_inference_steps=30,
    ).images[0]
    return result


def batch_stylize_alphabet(
    pipe: StableDiffusionImg2ImgPipeline,
    templates: dict[str, Image.Image],
    style_prompt_template: str,
    output_dir: str = "stylized",
    strength: float = 0.65,
) -> None:
    """Stylize every letter in the alphabet with a consistent style."""
    Path(output_dir).mkdir(exist_ok=True)

    for letter, template in templates.items():
        prompt = style_prompt_template.format(letter=letter)
        styled = stylize_letter(pipe, template, prompt, strength=strength)
        styled.save(f"{output_dir}/{letter}.png")
        print(f"Stylized: {letter}")


pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Load previously generated templates
templates = {}
for char_code in range(65, 91):
    letter = chr(char_code)
    templates[letter] = Image.open(f"templates/{letter}.png").convert("RGB")

# Style: Neon glow
batch_stylize_alphabet(
    pipe,
    templates,
    style_prompt_template="neon glowing letter {letter}, bright cyan light, dark background, electric, futuristic sign",
    output_dir="stylized_neon",
    strength=0.6,
)

# Style: Carved stone
batch_stylize_alphabet(
    pipe,
    templates,
    style_prompt_template="carved stone letter {letter}, ancient ruins, weathered granite, moss, photorealistic",
    output_dir="stylized_stone",
    strength=0.7,
)

The negative prompt "blurry, low quality, distorted, multiple letters" is important. Without it, the model sometimes generates extra characters or puts a whole word where you wanted a single letter. Telling it “multiple letters” in the negative prompt cuts that way down.

Prompt engineering for font styles follows a pattern: describe the letter explicitly, then the material or aesthetic, then the background. Keeping the background description simple helps the model focus on the letterform itself.

Better Shape Control with ControlNet

Img2img works well for moderate stylization, but if you need the output to closely match the original letter shape – say for an actual usable font – ControlNet with Canny edge detection gives you much tighter control.

The idea: extract edges from your template letter, then feed those edges as a structural constraint to the generation process. The model must follow the edge map while applying the style.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from PIL import Image, ImageDraw, ImageFont, ImageFilter
import torch
import numpy as np
import cv2


def extract_canny_edges(image: Image.Image, low: int = 100, high: int = 200) -> Image.Image:
    """Extract Canny edges from a PIL Image."""
    img_array = np.array(image.convert("L"))
    edges = cv2.Canny(img_array, low, high)
    return Image.fromarray(edges).convert("RGB")


# Load ControlNet for Canny edges
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Create template and extract edges
template = Image.new("RGB", (512, 512), "black")
draw = ImageDraw.Draw(template)
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 300)
bbox = draw.textbbox((0, 0), "R", font=font)
x = (512 - (bbox[2] - bbox[0])) // 2 - bbox[0]
y = (512 - (bbox[3] - bbox[1])) // 2 - bbox[1]
draw.text((x, y), "R", fill="white", font=font)

canny_image = extract_canny_edges(template)

result = pipe(
    prompt="ornate art nouveau letter R, gold filigree, floral decorations, vintage poster style",
    image=canny_image,
    negative_prompt="blurry, low quality, multiple letters, distorted shape",
    num_inference_steps=30,
    guidance_scale=7.5,
    controlnet_conditioning_scale=0.8,
).images[0]

result.save("controlnet_R.png")

The controlnet_conditioning_scale parameter (0.0 to 1.0) controls how strictly the model follows the edge map. At 0.8, the output closely tracks the letter outline while still allowing decorative details. Drop it to 0.5 for more creative freedom, or push to 1.0 if shape fidelity is critical.

ControlNet adds overhead – expect roughly 30-40% slower generation per image compared to plain img2img. For a full 26-letter alphabet, that’s noticeable. But the shape consistency is dramatically better, which matters if you’re assembling the outputs into an actual font file.

Common Errors and Fixes

RuntimeError: Expected all tensors to be on the same device You loaded the pipeline on CPU but your input tensor is on GPU, or vice versa. Make sure you call pipe.to("cuda") and that your input images are standard PIL Images (the pipeline handles conversion internally).

OSError: Can't load tokenizer for 'stable-diffusion-v1-5/stable-diffusion-v1-5' You need to accept the model license on Hugging Face and authenticate locally. Run huggingface-cli login and paste your access token. Alternatively, use a model that doesn’t require acceptance like stabilityai/stable-diffusion-2-1.

Generated images show multiple letters or random text Add "multiple letters, words, text, writing" to your negative prompt. Also try increasing guidance_scale to 8-10 to make the model follow your single-letter prompt more strictly.

Letters are unrecognizable after stylization Your strength is too high. Drop it to 0.4-0.5. With ControlNet, increase controlnet_conditioning_scale to 0.9 or higher.

cv2.error: OpenCV not found when using ControlNet approach Install OpenCV: pip install opencv-python. The headless version opencv-python-headless works too if you don’t need GUI features.

Out of memory on GPU Enable attention slicing to reduce VRAM usage:

1
2
3
pipe.enable_attention_slicing()
# Or for even lower memory:
pipe.enable_sequential_cpu_offload()

This trades speed for memory. Sequential CPU offload can run on GPUs with as little as 4GB VRAM, though generation will be slower.

Creating Template Letter Images#

Stylizing with Img2Img#

Better Shape Control with ControlNet#

Common Errors and Fixes#

Related Guides#

About the Author

Creating Template Letter Images

Stylizing with Img2Img

Better Shape Control with ControlNet

Common Errors and Fixes

Related Guides