How to Run InternVL3 Locally for Multimodal Document Understanding

InternVL3 is one of the best open-weight vision-language models available right now. The 8B variant fits on a single 24 GB GPU (RTX 3090/4090 class), handles high-resolution documents natively via its dynamic tiling approach, and genuinely competes with GPT-4V on document tasks. Here’s how to get it running locally.

Model Variants and GPU Requirements

Pick your model based on what you have:

Model	Parameters	Min VRAM (bf16)	Min VRAM (8-bit)	Notes
InternVL3-2B	2B	8 GB	5 GB	Fast, good for prototyping
InternVL3-8B	8B	20 GB	12 GB	Best single-GPU choice
InternVL3-14B	14B	32 GB	18 GB	Stronger reasoning
InternVL3-38B	38B	2× 80 GB A100	1× 80 GB	Near-frontier quality
InternVL3-78B	78B	3× 80 GB A100	2× 80 GB	Full capability

For document understanding on a single RTX 4090 (24 GB), the 8B model in bf16 is the right call. If you’re on a 16 GB card, use 8-bit quantization.

Installation

1
2
3
4
pip install torch torchvision Pillow
pip install "transformers>=4.52.1"
pip install flash-attn --no-build-isolation  # optional but strongly recommended
pip install bitsandbytes  # for 8-bit quantization

The flash-attn install takes a few minutes and requires a CUDA build environment. If it fails, the model still works—you just lose some throughput. Use transformers>=4.52.1; earlier versions have compatibility issues with InternVL3’s attention implementation.

Loading InternVL3-8B

Here is the complete setup including the image preprocessing utilities InternVL3 requires for its dynamic tiling:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import math
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoTokenizer, AutoModel

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size=448):
    return T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
    ])

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1)
        for i in range(1, n + 1) for j in range(1, n + 1)
        if min_num <= i * j <= max_num
    )
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
    best_ratio = min(
        target_ratios,
        key=lambda r: abs(aspect_ratio - r[0] / r[1])
    )
    target_width = image_size * best_ratio[0]
    target_height = image_size * best_ratio[1]
    blocks = best_ratio[0] * best_ratio[1]
    resized = image.resize((target_width, target_height))
    tiles = []
    cols = target_width // image_size
    for i in range(blocks):
        box = (
            (i % cols) * image_size,
            (i // cols) * image_size,
            ((i % cols) + 1) * image_size,
            ((i // cols) + 1) * image_size,
        )
        tiles.append(resized.crop(box))
    if use_thumbnail and blocks != 1:
        tiles.append(image.resize((image_size, image_size)))
    return tiles

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size)
    tiles = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = torch.stack([transform(t) for t in tiles])
    return pixel_values

# Load model — bf16 on GPU
model_path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,       # remove if flash-attn not installed
    trust_remote_code=True,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
    use_fast=False,
)

For 8-bit quantization on a 16 GB card, swap the from_pretrained call:

1
2
3
4
5
6
7
8
9
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,          # requires bitsandbytes
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
).eval()
# Note: no .cuda() needed — bitsandbytes handles device placement

Document Understanding Examples

OCR — Extract All Text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
generation_config = dict(max_new_tokens=2048, do_sample=False)

pixel_values = load_image("invoice.jpg", max_num=12).to(torch.bfloat16).cuda()

response = model.chat(
    tokenizer,
    pixel_values,
    "<image>\nExtract all text from this document exactly as it appears, preserving layout.",
    generation_config,
)
print(response)

The <image> token is required — InternVL3 uses it as a placeholder for the visual input in the prompt. max_num=12 allows up to 12 tiles for high-resolution documents, which matters for dense text like invoices or contracts.

Table Extraction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
pixel_values = load_image("financial_report.png", max_num=12).to(torch.bfloat16).cuda()

response = model.chat(
    tokenizer,
    pixel_values,
    "<image>\nExtract the table from this image and return it as a CSV with header row.",
    generation_config,
)
print(response)
# Returns: "Date,Revenue,Expenses,Net Profit\n2024-Q1,4200000,3100000,1100000\n..."

For tables with merged cells or complex formatting, ask for Markdown instead — it handles those cases more reliably than CSV.

Chart and Graph Reading

1
2
3
4
5
6
7
8
9
pixel_values = load_image("quarterly_chart.png", max_num=6).to(torch.bfloat16).cuda()

response = model.chat(
    tokenizer,
    pixel_values,
    "<image>\nDescribe the trend shown in this chart. List the approximate data points for each period.",
    generation_config,
)
print(response)

Reduce max_num to 6 for charts — they rarely benefit from the full tiling and the smaller tile count speeds up inference.

Multi-Page Document Processing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pathlib import Path

def process_document_pages(page_paths: list[str]) -> str:
    """Process multiple pages of a document and ask a cross-page question."""
    pixel_values_list = [
        load_image(p, max_num=6).to(torch.bfloat16).cuda()
        for p in page_paths
    ]
    pixel_values = torch.cat(pixel_values_list, dim=0)
    num_patches_list = [pv.size(0) for pv in pixel_values_list]

    # Build multi-image prompt
    image_tokens = "\n".join([f"Image-{i+1}: <image>" for i in range(len(page_paths))])
    question = f"{image_tokens}\nSummarize the key findings across all pages and list any action items."

    response = model.chat(
        tokenizer,
        pixel_values,
        question,
        dict(max_new_tokens=1024, do_sample=False),
        num_patches_list=num_patches_list,
    )
    return response

# Process a 3-page contract
pages = ["contract_p1.jpg", "contract_p2.jpg", "contract_p3.jpg"]
summary = process_document_pages(pages)
print(summary)

Serving with vLLM

For production use with concurrent requests, vLLM gives you much better throughput than the transformers pipeline. InternVL3 is a supported multimodal model in recent vLLM releases:

1
2
3
4
5
6
7
pip install "vllm>=0.6.0"

# Start OpenAI-compatible server
vllm serve OpenGVLab/InternVL3-8B \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --limit-mm-per-prompt image=4

Then query it like any OpenAI-compatible endpoint:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")

def encode_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_b64 = encode_image("document.jpg")

response = client.chat.completions.create(
    model="OpenGVLab/InternVL3-8B",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
                {"type": "text", "text": "Extract all text from this document."},
            ],
        }
    ],
    max_tokens=2048,
)
print(response.choices[0].message.content)

The --limit-mm-per-prompt image=4 flag caps images per request to prevent memory spikes. Adjust based on your GPU headroom.

How InternVL3 Compares to Other VLMs

For document tasks specifically, here’s where InternVL3 stands relative to other open-weight models you might consider:

InternVL3-8B — Best single-GPU option. Dynamic tiling handles high-res documents better than fixed-resolution models. Strong at structured data extraction.

Qwen3-VL-7B — Very close in capability. Better at multi-step reasoning tasks; slightly weaker on dense OCR. Worth benchmarking on your specific document type.

GLM-4.6V-9B — Good multilingual support (especially Chinese documents). Slightly behind on English document benchmarks.

For raw OCR accuracy on printed English text, InternVL3-8B and Qwen3-VL-7B are neck-and-neck. InternVL3’s dynamic tiling gives it an edge on very large or wide-format documents like spreadsheet screenshots.

Common Issues

CUDA out of memory with 8B on 24 GB: Lower max_num from 12 to 6. Each tile is a separate attention pass — cutting tiles in half roughly halves the peak memory for vision tokens.

1
pixel_values = load_image("doc.jpg", max_num=6).to(torch.bfloat16).cuda()

trust_remote_code=True warning: This is expected — InternVL3 ships custom attention code. Review the model card on Hugging Face if you need to audit the code before running it in a secure environment.

Slow first inference: The first call compiles CUDA kernels. Subsequent calls are much faster. Build a warm-up call into your startup sequence if you’re measuring latency.

vLLM ValueError: 'limit_mm_per_prompt' is only supported for multimodal models: You’re hitting this because vLLM didn’t detect the model as multimodal. Make sure you’re on vllm>=0.6.0 and using the correct model ID (OpenGVLab/InternVL3-8B, not a fine-tuned variant that may have broken the config).

Model Variants and GPU Requirements#

Installation#

Loading InternVL3-8B#

Document Understanding Examples#

OCR — Extract All Text#

Table Extraction#

Chart and Graph Reading#

Multi-Page Document Processing#

Serving with vLLM#

How InternVL3 Compares to Other VLMs#

Common Issues#

Related Guides#

About the Author