InternVL3 is one of the best open-weight vision-language models available right now. The 8B variant fits on a single 24 GB GPU (RTX 3090/4090 class), handles high-resolution documents natively via its dynamic tiling approach, and genuinely competes with GPT-4V on document tasks. Here’s how to get it running locally.
Model Variants and GPU Requirements#
Pick your model based on what you have:
| Model | Parameters | Min VRAM (bf16) | Min VRAM (8-bit) | Notes |
|---|
| InternVL3-2B | 2B | 8 GB | 5 GB | Fast, good for prototyping |
| InternVL3-8B | 8B | 20 GB | 12 GB | Best single-GPU choice |
| InternVL3-14B | 14B | 32 GB | 18 GB | Stronger reasoning |
| InternVL3-38B | 38B | 2× 80 GB A100 | 1× 80 GB | Near-frontier quality |
| InternVL3-78B | 78B | 3× 80 GB A100 | 2× 80 GB | Full capability |
For document understanding on a single RTX 4090 (24 GB), the 8B model in bf16 is the right call. If you’re on a 16 GB card, use 8-bit quantization.
Installation#
1
2
3
4
| pip install torch torchvision Pillow
pip install "transformers>=4.52.1"
pip install flash-attn --no-build-isolation # optional but strongly recommended
pip install bitsandbytes # for 8-bit quantization
|
The flash-attn install takes a few minutes and requires a CUDA build environment. If it fails, the model still works—you just lose some throughput. Use transformers>=4.52.1; earlier versions have compatibility issues with InternVL3’s attention implementation.
Loading InternVL3-8B#
Here is the complete setup including the image preprocessing utilities InternVL3 requires for its dynamic tiling:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
| import math
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoTokenizer, AutoModel
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size=448):
return T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1)
for i in range(1, n + 1) for j in range(1, n + 1)
if min_num <= i * j <= max_num
)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
best_ratio = min(
target_ratios,
key=lambda r: abs(aspect_ratio - r[0] / r[1])
)
target_width = image_size * best_ratio[0]
target_height = image_size * best_ratio[1]
blocks = best_ratio[0] * best_ratio[1]
resized = image.resize((target_width, target_height))
tiles = []
cols = target_width // image_size
for i in range(blocks):
box = (
(i % cols) * image_size,
(i // cols) * image_size,
((i % cols) + 1) * image_size,
((i // cols) + 1) * image_size,
)
tiles.append(resized.crop(box))
if use_thumbnail and blocks != 1:
tiles.append(image.resize((image_size, image_size)))
return tiles
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size)
tiles = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = torch.stack([transform(t) for t in tiles])
return pixel_values
# Load model — bf16 on GPU
model_path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True, # remove if flash-attn not installed
trust_remote_code=True,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=False,
)
|
For 8-bit quantization on a 16 GB card, swap the from_pretrained call:
1
2
3
4
5
6
7
8
9
| model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
load_in_8bit=True, # requires bitsandbytes
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
).eval()
# Note: no .cuda() needed — bitsandbytes handles device placement
|
Document Understanding Examples#
1
2
3
4
5
6
7
8
9
10
11
| generation_config = dict(max_new_tokens=2048, do_sample=False)
pixel_values = load_image("invoice.jpg", max_num=12).to(torch.bfloat16).cuda()
response = model.chat(
tokenizer,
pixel_values,
"<image>\nExtract all text from this document exactly as it appears, preserving layout.",
generation_config,
)
print(response)
|
The <image> token is required — InternVL3 uses it as a placeholder for the visual input in the prompt. max_num=12 allows up to 12 tiles for high-resolution documents, which matters for dense text like invoices or contracts.
1
2
3
4
5
6
7
8
9
10
| pixel_values = load_image("financial_report.png", max_num=12).to(torch.bfloat16).cuda()
response = model.chat(
tokenizer,
pixel_values,
"<image>\nExtract the table from this image and return it as a CSV with header row.",
generation_config,
)
print(response)
# Returns: "Date,Revenue,Expenses,Net Profit\n2024-Q1,4200000,3100000,1100000\n..."
|
For tables with merged cells or complex formatting, ask for Markdown instead — it handles those cases more reliably than CSV.
Chart and Graph Reading#
1
2
3
4
5
6
7
8
9
| pixel_values = load_image("quarterly_chart.png", max_num=6).to(torch.bfloat16).cuda()
response = model.chat(
tokenizer,
pixel_values,
"<image>\nDescribe the trend shown in this chart. List the approximate data points for each period.",
generation_config,
)
print(response)
|
Reduce max_num to 6 for charts — they rarely benefit from the full tiling and the smaller tile count speeds up inference.
Multi-Page Document Processing#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from pathlib import Path
def process_document_pages(page_paths: list[str]) -> str:
"""Process multiple pages of a document and ask a cross-page question."""
pixel_values_list = [
load_image(p, max_num=6).to(torch.bfloat16).cuda()
for p in page_paths
]
pixel_values = torch.cat(pixel_values_list, dim=0)
num_patches_list = [pv.size(0) for pv in pixel_values_list]
# Build multi-image prompt
image_tokens = "\n".join([f"Image-{i+1}: <image>" for i in range(len(page_paths))])
question = f"{image_tokens}\nSummarize the key findings across all pages and list any action items."
response = model.chat(
tokenizer,
pixel_values,
question,
dict(max_new_tokens=1024, do_sample=False),
num_patches_list=num_patches_list,
)
return response
# Process a 3-page contract
pages = ["contract_p1.jpg", "contract_p2.jpg", "contract_p3.jpg"]
summary = process_document_pages(pages)
print(summary)
|
Serving with vLLM#
For production use with concurrent requests, vLLM gives you much better throughput than the transformers pipeline. InternVL3 is a supported multimodal model in recent vLLM releases:
1
2
3
4
5
6
7
| pip install "vllm>=0.6.0"
# Start OpenAI-compatible server
vllm serve OpenGVLab/InternVL3-8B \
--dtype bfloat16 \
--max-model-len 8192 \
--limit-mm-per-prompt image=4
|
Then query it like any OpenAI-compatible endpoint:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")
def encode_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
image_b64 = encode_image("document.jpg")
response = client.chat.completions.create(
model="OpenGVLab/InternVL3-8B",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text", "text": "Extract all text from this document."},
],
}
],
max_tokens=2048,
)
print(response.choices[0].message.content)
|
The --limit-mm-per-prompt image=4 flag caps images per request to prevent memory spikes. Adjust based on your GPU headroom.
How InternVL3 Compares to Other VLMs#
For document tasks specifically, here’s where InternVL3 stands relative to other open-weight models you might consider:
InternVL3-8B — Best single-GPU option. Dynamic tiling handles high-res documents better than fixed-resolution models. Strong at structured data extraction.
Qwen3-VL-7B — Very close in capability. Better at multi-step reasoning tasks; slightly weaker on dense OCR. Worth benchmarking on your specific document type.
GLM-4.6V-9B — Good multilingual support (especially Chinese documents). Slightly behind on English document benchmarks.
For raw OCR accuracy on printed English text, InternVL3-8B and Qwen3-VL-7B are neck-and-neck. InternVL3’s dynamic tiling gives it an edge on very large or wide-format documents like spreadsheet screenshots.
Common Issues#
CUDA out of memory with 8B on 24 GB:
Lower max_num from 12 to 6. Each tile is a separate attention pass — cutting tiles in half roughly halves the peak memory for vision tokens.
1
| pixel_values = load_image("doc.jpg", max_num=6).to(torch.bfloat16).cuda()
|
trust_remote_code=True warning:
This is expected — InternVL3 ships custom attention code. Review the model card on Hugging Face if you need to audit the code before running it in a secure environment.
Slow first inference:
The first call compiles CUDA kernels. Subsequent calls are much faster. Build a warm-up call into your startup sequence if you’re measuring latency.
vLLM ValueError: 'limit_mm_per_prompt' is only supported for multimodal models:
You’re hitting this because vLLM didn’t detect the model as multimodal. Make sure you’re on vllm>=0.6.0 and using the correct model ID (OpenGVLab/InternVL3-8B, not a fine-tuned variant that may have broken the config).