How to Build AI Image Upscaling with Real-ESRGAN and SwinIR

Real-ESRGAN 4x Upscaling in 10 Lines

Real-ESRGAN is the best general-purpose image upscaler available right now. It handles photos, illustrations, compressed JPEGs, and even anime art with dedicated model weights. Here is how to get a 4x upscale running with OpenCV.

1
pip install realesrgan opencv-python-headless torch torchvision basicsr

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import cv2
import numpy as np
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer

# Build the RRDB network for 4x upscaling
net = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)

upscaler = RealESRGANer(
    scale=4,
    model_path="https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth",
    model=net,
    tile=0,
    tile_pad=10,
    pre_pad=0,
    half=True,
)

# Load with OpenCV (BGR format, which RealESRGANer expects)
img_bgr = cv2.imread("input_photo.jpg", cv2.IMREAD_UNCHANGED)
output_bgr, _ = upscaler.enhance(img_bgr, outscale=4)
cv2.imwrite("output_4x.png", output_bgr)

h, w = img_bgr.shape[:2]
oh, ow = output_bgr.shape[:2]
print(f"Upscaled {w}x{h} -> {ow}x{oh}")
# Upscaled 512x384 -> 2048x1536

The model_path accepts both URLs and local file paths. On first run it downloads the weights (~67MB) to a local cache. After that, inference on a 512x512 image takes about 0.5 seconds on an RTX 3060.

One thing to note: RealESRGANer works natively with OpenCV BGR arrays. If you pass RGB data from PIL, the colors get swapped. Stick with cv2.imread to avoid headaches.

2x vs 4x Upscaling with Different Weights

Real-ESRGAN ships several pre-trained weight files tuned for different scale factors and content types. The 2x model (RealESRGAN_x2plus) is faster and works well when you only need a moderate size bump.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import cv2
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer

def create_upscaler(scale: int = 4, content_type: str = "photo") -> RealESRGANer:
    """Create an upscaler for a given scale factor and content type."""
    weight_map = {
        (4, "photo"): {
            "url": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth",
            "num_block": 23,
        },
        (4, "anime"): {
            "url": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth",
            "num_block": 6,
        },
        (2, "photo"): {
            "url": "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth",
            "num_block": 23,
        },
    }

    key = (scale, content_type)
    if key not in weight_map:
        raise ValueError(f"No weights for scale={scale}, content_type={content_type}")

    config = weight_map[key]
    net = RRDBNet(
        num_in_ch=3, num_out_ch=3, num_feat=64,
        num_block=config["num_block"], num_grow_ch=32, scale=scale,
    )
    return RealESRGANer(
        scale=scale,
        model_path=config["url"],
        model=net,
        tile=0,
        tile_pad=10,
        pre_pad=0,
        half=True,
    )

# 4x general photo upscaling
upscaler_4x = create_upscaler(scale=4, content_type="photo")

# 4x anime/illustration upscaling (6-block model, faster)
upscaler_anime = create_upscaler(scale=4, content_type="anime")

# 2x photo upscaling (for moderate enlargement)
upscaler_2x = create_upscaler(scale=2, content_type="photo")

img = cv2.imread("sample.jpg", cv2.IMREAD_UNCHANGED)
result_4x, _ = upscaler_4x.enhance(img, outscale=4)
result_anime, _ = upscaler_anime.enhance(img, outscale=4)
result_2x, _ = upscaler_2x.enhance(img, outscale=2)

cv2.imwrite("result_4x_photo.png", result_4x)
cv2.imwrite("result_4x_anime.png", result_anime)
cv2.imwrite("result_2x_photo.png", result_2x)

The anime model (RealESRGAN_x4plus_anime_6B) uses only 6 RRDB blocks instead of 23. That makes it roughly 3x faster while producing cleaner lines on drawn content. For photographs, always use the full 23-block model – the extra parameters handle complex textures like hair, fabric, and foliage much better.

Batch Processing a Directory

Processing an entire folder of images is the common production use case. This function walks a directory, upscales every image, and tracks timing stats.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import cv2
import time
from pathlib import Path
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer

def batch_upscale(input_dir: str, output_dir: str, scale: int = 4, tile: int = 0) -> dict:
    """Upscale all images in a directory."""
    net = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=scale)
    upscaler = RealESRGANer(
        scale=scale,
        model_path="https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth",
        model=net,
        tile=tile,
        tile_pad=10,
        pre_pad=0,
        half=True,
    )

    in_path = Path(input_dir)
    out_path = Path(output_dir)
    out_path.mkdir(parents=True, exist_ok=True)

    extensions = {".jpg", ".jpeg", ".png", ".webp", ".bmp", ".tiff"}
    image_files = sorted(f for f in in_path.iterdir() if f.suffix.lower() in extensions)

    stats = {"processed": 0, "failed": 0, "total_seconds": 0.0}

    for img_file in image_files:
        try:
            start = time.monotonic()
            img = cv2.imread(str(img_file), cv2.IMREAD_UNCHANGED)
            if img is None:
                raise ValueError(f"OpenCV failed to read {img_file.name}")

            output, _ = upscaler.enhance(img, outscale=scale)
            out_file = out_path / f"{img_file.stem}_{scale}x.png"
            cv2.imwrite(str(out_file), output)

            elapsed = time.monotonic() - start
            stats["processed"] += 1
            stats["total_seconds"] += elapsed
            print(f"  OK: {img_file.name} -> {out_file.name} ({elapsed:.1f}s)")
        except Exception as e:
            stats["failed"] += 1
            print(f"  FAIL: {img_file.name}: {e}")

    avg = stats["total_seconds"] / max(stats["processed"], 1)
    print(f"\nFinished: {stats['processed']} ok, {stats['failed']} failed, avg {avg:.1f}s/image")
    return stats

# Upscale everything in ./photos/ with tiling for safety on large images
batch_upscale("./photos/", "./photos_upscaled/", scale=4, tile=512)

Setting tile=512 for batch jobs is a smart default. You never know when one image in the folder will be 4000x3000 and blow up your GPU memory. The tile overhead is minimal on smaller images anyway.

Face Enhancement with GFPGAN

Real-ESRGAN is great at general content, but faces need special treatment. GFPGAN is a face restoration model that works alongside Real-ESRGAN. It detects faces, enhances them with a dedicated GAN, and composites them back into the upscaled image.

1
pip install gfpgan

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import cv2
from basicsr.archs.rrdbnet_arch import RRDBNet
from realesrgan import RealESRGANer
from gfpgan import GFPGANer

# Set up the background upscaler first
net = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)
bg_upscaler = RealESRGANer(
    scale=4,
    model_path="https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth",
    model=net,
    tile=512,
    tile_pad=10,
    pre_pad=0,
    half=True,
)

# Set up GFPGAN face enhancer with the background upscaler
face_enhancer = GFPGANer(
    model_path="https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.4.pth",
    upscale=4,
    arch="clean",
    channel_multiplier=2,
    bg_upsampler=bg_upscaler,
)

img = cv2.imread("group_photo.jpg", cv2.IMREAD_UNCHANGED)

# enhance() returns: cropped_faces, restored_faces, full_output
cropped_faces, restored_faces, output = face_enhancer.enhance(
    img,
    has_aligned=False,
    only_center_face=False,
    paste_back=True,
)

cv2.imwrite("group_photo_enhanced.png", output)
print(f"Detected and enhanced {len(restored_faces)} face(s)")
# Detected and enhanced 3 face(s)

# Save individual restored faces for inspection
for i, face in enumerate(restored_faces):
    cv2.imwrite(f"face_{i}.png", face)

The paste_back=True parameter is critical. Without it, you only get the cropped face regions. With it, GFPGAN composites the restored faces back into the Real-ESRGAN-upscaled background, giving you a complete enhanced image.

Set only_center_face=False for group photos so it processes every detected face. For portraits where you only care about the main subject, set it to True for faster processing.

SwinIR: Transformer-Based Upscaling

SwinIR uses a Swin Transformer backbone instead of the RRDB convolutional network in Real-ESRGAN. It produces slightly sharper results on certain content types, especially textures with repeating patterns. The tradeoff is speed – SwinIR runs about 3-4x slower than Real-ESRGAN on the same hardware.

You can run SwinIR through the official repository code. Clone it and use the pre-trained weights directly.

1
2
3
pip install timm
git clone https://github.com/JingyunLiang/SwinIR.git
cd SwinIR

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import cv2
import numpy as np
import torch
from pathlib import Path
import sys

# Add SwinIR to path (after cloning the repo)
sys.path.insert(0, str(Path("./SwinIR")))
from models.network_swinir import SwinIR as SwinIRModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load SwinIR model for real-world 4x super resolution
model = SwinIRModel(
    upscale=4,
    in_chans=3,
    img_size=64,
    window_size=8,
    img_range=1.0,
    depths=[6, 6, 6, 6, 6, 6],
    embed_dim=180,
    num_heads=[6, 6, 6, 6, 6, 6],
    mlp_ratio=2,
    upsampler="nearest+conv",
    resi_connection="1conv",
)

# Load pre-trained weights
weights_path = "003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth"
state_dict = torch.load(weights_path, map_location=device)
if "params_ema" in state_dict:
    model.load_state_dict(state_dict["params_ema"], strict=True)
elif "params" in state_dict:
    model.load_state_dict(state_dict["params"], strict=True)
model = model.to(device).eval()

# Read and preprocess image
img_bgr = cv2.imread("input.jpg", cv2.IMREAD_COLOR)
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
img_tensor = torch.from_numpy(img_rgb).permute(2, 0, 1).unsqueeze(0).to(device)

# Pad to multiple of window_size (8)
_, _, h, w = img_tensor.shape
pad_h = (8 - h % 8) % 8
pad_w = (8 - w % 8) % 8
img_tensor = torch.nn.functional.pad(img_tensor, (0, pad_w, 0, pad_h), mode="reflect")

# Run inference
with torch.no_grad():
    output_tensor = model(img_tensor)

# Remove padding and convert back
output_tensor = output_tensor[:, :, :h * 4, :w * 4]
output_np = output_tensor.squeeze(0).permute(1, 2, 0).cpu().clamp(0, 1).numpy()
output_bgr = cv2.cvtColor((output_np * 255).astype(np.uint8), cv2.COLOR_RGB2BGR)
cv2.imwrite("output_swinir_4x.png", output_bgr)

print(f"SwinIR upscaled {w}x{h} -> {w*4}x{h*4}")

The padding step matters. SwinIR’s window attention mechanism requires input dimensions to be multiples of 8. Without padding, you get a runtime error. After inference, crop the output back to the expected size by multiplying the original dimensions by the scale factor.

Real-ESRGAN vs SwinIR: When to Use Which

My recommendation: use Real-ESRGAN for 95% of use cases. It is faster, easier to set up, and the quality difference is marginal on most content.

Pick SwinIR when:

You are upscaling a small number of high-value images and speed does not matter
The content has fine repeating textures (fabrics, architectural details, text)
You need the absolute best PSNR/SSIM metrics for a benchmark or paper

Pick Real-ESRGAN when:

You need batch processing at any reasonable speed
You are building a production API or pipeline
The content is mixed (photos, illustrations, screenshots)
You want face enhancement (GFPGAN only integrates with Real-ESRGAN)

GPU Memory and Tile Size Tuning

The tile parameter in RealESRGANer controls how much of the image is processed at once. Setting it to 0 means the whole image goes through the network in one pass, which is fastest but uses the most VRAM.

Here are practical tile size recommendations based on GPU memory:

GPU VRAM	Recommended Tile Size	Notes
24GB+	0 (no tiling)	Process full images up to ~4K
8-12GB	512	Good balance of speed and memory
4-6GB	256	Slower but fits in budget GPUs
2-4GB	128	Very slow, consider CPU or cloud

The tile_pad parameter controls overlap between tiles. Keep it at 10-32 pixels. Lower values are faster, higher values reduce stitching artifacts on high-frequency content. For most images, tile_pad=10 is fine. If you see faint grid lines in the output, bump it to 32.

Also set half=True on any CUDA GPU that supports fp16 (anything Pascal or newer). This cuts memory usage roughly in half with negligible quality loss.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch

# Check available VRAM and pick tile size automatically
def auto_tile_size() -> int:
    if not torch.cuda.is_available():
        return 128  # CPU fallback, tile small
    vram_gb = torch.cuda.get_device_properties(0).total_mem / (1024 ** 3)
    if vram_gb >= 20:
        return 0
    elif vram_gb >= 8:
        return 512
    elif vram_gb >= 4:
        return 256
    else:
        return 128

tile = auto_tile_size()
if torch.cuda.is_available():
    vram_gb = torch.cuda.get_device_properties(0).total_mem / (1024**3)
    print(f"Using tile size: {tile} (VRAM: {vram_gb:.1f}GB)")
else:
    print(f"Using tile size: {tile} (CPU mode)")

Common Errors and Fixes

RuntimeError: CUDA out of memory

This is the most frequent issue. Set tile=256 or tile=128 in the RealESRGANer constructor. If you are already using tiling, reduce the tile size further. Make sure half=True is set. As a last resort, move to CPU with half=False (fp16 is not supported on CPU).

Output colors look wrong (blue/orange swap)

Real-ESRGAN expects BGR input (OpenCV format). If you load images with PIL, they are in RGB. Either use cv2.imread consistently, or convert before passing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from PIL import Image
import numpy as np
import cv2

pil_img = Image.open("photo.jpg")
rgb_array = np.array(pil_img)
bgr_array = cv2.cvtColor(rgb_array, cv2.COLOR_RGB2BGR)
output, _ = upscaler.enhance(bgr_array, outscale=4)
# Convert back to RGB if saving with PIL
output_rgb = cv2.cvtColor(output, cv2.COLOR_BGR2RGB)
Image.fromarray(output_rgb).save("output.png")

FileNotFoundError for model weights

When you pass a URL as model_path, the library downloads weights to ~/.cache/realesrgan/. If the download is interrupted, you get a corrupt file. Delete the cache directory and try again:

1
rm -rf ~/.cache/realesrgan/

Alternatively, download the weights manually and pass a local path.

Grid lines visible in tiled output

Increase tile_pad from 10 to 32 or 64. The padding creates overlap between adjacent tiles that gets blended, and larger overlap produces smoother transitions.

SwinIR RuntimeError: input tensor size mismatch

The input image dimensions must be padded to multiples of the window size (8). The padding code in the SwinIR section above handles this. Make sure you pad before inference and crop the output to the correct size afterward.

GFPGAN does not detect any faces

GFPGAN uses a face detection model internally. If faces are very small (under 64x64 pixels in the input), detection fails. Upscale the image first with Real-ESRGAN at 2x, then run GFPGAN on the upscaled result. Also check that the input is not grayscale – convert to 3-channel BGR first.

Slow inference on CPU

CPU inference for a single 512x512 image takes 30-60 seconds with Real-ESRGAN and 2-3 minutes with SwinIR. Use the anime model (RealESRGAN_x4plus_anime_6B) for faster CPU performance – its 6-block architecture is roughly 3x faster than the 23-block general model. For production workloads, a GPU is not optional.

Real-ESRGAN 4x Upscaling in 10 Lines#

2x vs 4x Upscaling with Different Weights#

Batch Processing a Directory#

Face Enhancement with GFPGAN#

SwinIR: Transformer-Based Upscaling#

Real-ESRGAN vs SwinIR: When to Use Which#

GPU Memory and Tile Size Tuning#

Common Errors and Fixes#

Related Guides#

About the Author