How to Generate Images in Real Time with Latent Consistency Models

The Quick Version

Standard Stable Diffusion needs 20-50 denoising steps to produce a good image. Latent Consistency Models (LCM) cut that to 2-4 steps by distilling the diffusion process into a consistency model. The result: image generation in under a second on a consumer GPU.

1
pip install diffusers transformers accelerate torch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "SimianLuo/LCM_Dreamshaper_v7",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="A futuristic city skyline at golden hour, cinematic lighting",
    num_inference_steps=4,
    guidance_scale=1.0,
).images[0]

image.save("lcm_output.png")

Four steps. That’s it. On an RTX 3090, this generates a 512x512 image in about 0.3 seconds. Compare that to 8-15 seconds for standard Stable Diffusion with 30 steps.

LCM-LoRA: Any Model, Faster

The standalone LCM model above is locked to Dreamshaper v7. LCM-LoRA is more flexible — it’s a LoRA adapter that makes any SDXL or SD 1.5 model fast. Apply it to your favorite checkpoint and get 2-4 step generation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from diffusers import DiffusionPipeline, LCMScheduler
import torch

# Load any SDXL model
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

# Apply LCM-LoRA adapter
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")

# Switch to LCM scheduler
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

image = pipe(
    prompt="An astronaut riding a horse on Mars, detailed, 8k",
    num_inference_steps=4,
    guidance_scale=1.5,
).images[0]

image.save("lcm_lora_sdxl.png")

Combining with Other LoRAs

LCM-LoRA stacks with style LoRAs. Load both and the model generates in the target style at LCM speed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Load base model
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16",
).to("cuda")

# Load LCM-LoRA for speed
pipe.load_lora_weights(
    "latent-consistency/lcm-lora-sdxl",
    adapter_name="lcm",
)

# Load a style LoRA on top
pipe.load_lora_weights(
    "TheLastBen/Papercut_SDXL",
    adapter_name="papercut",
)

# Set weights for each adapter
pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8])
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

image = pipe(
    prompt="A paper cut art of a mountain landscape",
    num_inference_steps=4,
    guidance_scale=1.5,
).images[0]

image.save("lcm_papercut.png")

Real-Time Interactive Generation

LCM’s speed enables interactive workflows where the image updates as you type or adjust parameters. Here’s a simple loop that regenerates on prompt changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from diffusers import DiffusionPipeline, LCMScheduler
import torch
import time

pipe = DiffusionPipeline.from_pretrained(
    "SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float16,
).to("cuda")

# Warm up the pipeline (first run is slower due to CUDA compilation)
_ = pipe("warmup", num_inference_steps=2, guidance_scale=1.0)

prompts = [
    "A cat sitting on a windowsill",
    "A cat sitting on a windowsill, rain outside",
    "A cat sitting on a windowsill, rain outside, cozy lighting",
    "A cat sitting on a windowsill, rain outside, cozy lighting, oil painting style",
]

for prompt in prompts:
    start = time.time()
    image = pipe(
        prompt=prompt,
        num_inference_steps=4,
        guidance_scale=1.0,
        width=512,
        height=512,
    ).images[0]
    elapsed = time.time() - start
    image.save(f"interactive_{prompts.index(prompt)}.png")
    print(f"{elapsed:.2f}s — {prompt}")

For a web app, pair this with WebSockets. The client sends prompt updates, the server generates images with LCM, and streams the results back. At 3-5 FPS, it feels almost real-time.

img2img with LCM

LCM also works for image-to-image transformations. Sketch something rough and LCM refines it in milliseconds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from diffusers import AutoPipelineForImage2Image, LCMScheduler
from PIL import Image
import torch

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16",
).to("cuda")

pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

sketch = Image.open("rough_sketch.png").convert("RGB").resize((1024, 1024))

image = pipe(
    prompt="A detailed architectural drawing of a modern house",
    image=sketch,
    num_inference_steps=4,
    guidance_scale=1.5,
    strength=0.6,
).images[0]

image.save("refined_sketch.png")

The strength parameter controls how much LCM changes the input. Low values (0.3-0.5) keep close to the original. High values (0.7-0.9) give the model more freedom to reimagine the image.

Optimizing for Maximum Speed

Stack these optimizations to push generation time even lower:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
from diffusers import DiffusionPipeline, LCMScheduler

pipe = DiffusionPipeline.from_pretrained(
    "SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float16,
).to("cuda")

# Optimization 1: Compile the UNet with torch.compile
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

# Optimization 2: Enable memory-efficient attention
pipe.enable_xformers_memory_efficient_attention()

# Optimization 3: Use 2 steps instead of 4 (quality tradeoff)
image = pipe(
    "A landscape photo",
    num_inference_steps=2,
    guidance_scale=1.0,
    width=512,
    height=512,
).images[0]

torch.compile adds a one-time compilation overhead (30-60 seconds) but speeds up every subsequent generation by 20-40%. The reduce-overhead mode is best for repeated calls with the same input shapes.

With all optimizations on an RTX 4090: ~0.1 seconds for 512x512 at 2 steps. That’s 10 FPS — actual real-time generation.

Quality vs. Speed Tradeoffs

Steps	Time (RTX 3090)	Quality	Best For
1	~0.1s	Low — noisy, missing details	Previews, thumbnails
2	~0.15s	Medium — coherent but soft	Interactive drafting
4	~0.3s	Good — close to 20-step SD	General use
8	~0.6s	Great — nearly indistinguishable	Final output

For most use cases, 4 steps hit the sweet spot. Drop to 2 for interactive tools where speed matters more than perfection. Go to 8 when generating final assets.

Common Errors and Fixes

Images are blurry or washed out

Set guidance_scale between 1.0 and 2.0 for LCM. Unlike standard Stable Diffusion which uses 7-12, LCM was trained with low guidance. Higher values produce artifacts.

torch.compile fails with errors

Not all operations are compatible with torch.compile. If you hit errors, try mode="default" instead of "reduce-overhead", or skip compilation and rely on xformers alone.

Out of memory on consumer GPUs

Enable CPU offloading for the text encoder: pipe.enable_model_cpu_offload(). This moves components to CPU when not in use and frees VRAM for the UNet denoising passes.

LCM-LoRA doesn’t speed things up

Make sure you switched the scheduler to LCMScheduler. Without it, the LoRA weights are loaded but the sampling process still uses the original 20+ step schedule.

Artifacts at SDXL resolution (1024x1024)

LCM-LoRA for SDXL sometimes produces grid-like artifacts at full resolution. Try generating at 768x768 and upscaling, or increase steps to 6-8 for cleaner results.

The Quick Version#

LCM-LoRA: Any Model, Faster#

Combining with Other LoRAs#

Real-Time Interactive Generation#

img2img with LCM#

Optimizing for Maximum Speed#

Quality vs. Speed Tradeoffs#

Common Errors and Fixes#

Related Guides#

About the Author