Ollama gives you a one-command path to running models locally. llama.cpp gives you bare-metal control over every inference parameter. Both use GGUF model files, both run on CPU or GPU, and both expose an OpenAI-compatible API. Pick Ollama when you want things to work immediately. Pick llama.cpp when you need to tune every knob.

Here is the fastest way to get a model running on your machine with each tool.

Install and Run Ollama

Ollama packages llama.cpp behind a clean CLI. Install it on Linux with a single command:

1
curl -fsSL https://ollama.com/install.sh | sh

On macOS, download the app from ollama.com or use Homebrew:

1
brew install ollama

Pull and run a model in one step:

1
ollama run llama3.1:8b

That command downloads the 4-bit quantized GGUF file (~4.7 GB), loads it, and drops you into an interactive chat. Ollama auto-detects your GPU – NVIDIA on Linux, Metal on macOS – and offloads layers without any flags.

To run the model as a background server instead of an interactive session:

1
2
ollama serve &
ollama pull llama3.1:8b

List your downloaded models and their sizes:

1
2
3
4
ollama list
# NAME              ID            SIZE    MODIFIED
# llama3.1:8b       a]2c6b7c75...  4.7 GB  2 minutes ago
# qwen2.5:7b        845d...        4.4 GB  1 hour ago

Hardware Requirements

The rule of thumb: you need roughly the model’s disk size in RAM (or VRAM for GPU inference) plus 1-2 GB overhead for KV cache and runtime buffers.

Model SizeQuantizationDisk / RAMMinimum VRAM
7-8BQ4_K_M~4.5 GB6 GB
13BQ4_K_M~7.5 GB10 GB
34BQ4_K_M~20 GB24 GB
70BQ4_K_M~40 GB48 GB

If the model doesn’t fit entirely in VRAM, both Ollama and llama.cpp split layers between GPU and CPU automatically. You get GPU-speed generation for the offloaded layers and CPU-speed for the rest. A 70B Q4 model on a 24 GB GPU still works – it’s just slower than full offload.

Running on CPU only is viable for 7-8B models on machines with 16+ GB of RAM. Expect 5-15 tokens/second depending on your processor.

Use the OpenAI-Compatible API

Ollama exposes an API on port 11434 that mimics the OpenAI chat completions endpoint. Any tool built for the OpenAI SDK works with a URL swap:

1
2
3
4
5
6
7
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain GGUF in one paragraph."}],
    "temperature": 0.7
  }'

In Python with the OpenAI client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is GGUF?"}],
)
print(response.choices[0].message.content)

This means you can swap a local model into any LangChain, LiteLLM, or custom app that targets the OpenAI API. Set the base URL, pick a model name from ollama list, and everything else stays the same.

Build llama.cpp from Source

When you need to control GPU layer count, context size, batch parameters, or run models Ollama doesn’t support yet, build llama.cpp directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# CPU-only build
cmake -B build
cmake --build build --config Release -j$(nproc)

# NVIDIA GPU build (requires CUDA toolkit)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# macOS Metal build (enabled by default on Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)

Download a GGUF model from Hugging Face (Bartowski, TheBloke, and the official model repos all publish pre-quantized GGUFs):

1
2
# Example: grab a Q4_K_M quantization of Llama 3.1 8B
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

Run inference from the CLI:

1
2
3
4
5
./build/bin/llama-cli \
  -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  -p "Explain the GGUF file format in three sentences."

The -ngl 99 flag offloads all model layers to the GPU. If you don’t have enough VRAM, lower this number – llama.cpp will keep the remaining layers on CPU. The -c 4096 flag sets the context window size.

Run llama.cpp as an API Server

llama.cpp ships llama-server, which exposes the same OpenAI-compatible endpoint:

1
2
3
4
5
6
./build/bin/llama-server \
  -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080

Now you can hit http://localhost:8080/v1/chat/completions with the same curl or Python code from the Ollama section. Just change the port.

For production-like setups, add --parallel 4 to handle multiple concurrent requests and --cont-batching for continuous batching. These flags make a real difference under load.

GGUF Quantization Formats

GGUF files come in different quantization levels. The naming convention tells you the precision:

  • Q4_K_M – 4-bit quantization, medium quality. Best balance of size and quality for most use cases.
  • Q5_K_M – 5-bit. Slightly better output quality, ~25% larger files.
  • Q8_0 – 8-bit. Near-original quality, roughly double the size of Q4.
  • Q2_K – 2-bit. Aggressive compression. Noticeable quality loss, but fits large models in limited RAM.
  • F16 – 16-bit float. No quality loss. Use only if you have the VRAM to spare.

For a 7-8B model, start with Q4_K_M. If the outputs feel off, step up to Q5_K_M. For 70B+ models where VRAM is tight, Q4_K_M or even Q3_K_M is usually the pragmatic choice.

Troubleshooting Common Errors

Ollama returns “Error: model requires more system memory”

The model doesn’t fit in available RAM + VRAM. Switch to a smaller quantization (ollama run llama3.1:8b-q4_0 instead of the default) or close other applications to free memory.

llama.cpp crashes with CUDA error: out of memory

Lower the -ngl value. Start with -ngl 20 and increase until you hit your VRAM ceiling. You can check per-layer memory usage in the startup logs – llama.cpp prints the size of each offloaded layer.

Ollama shows connection refused on port 11434

The Ollama server isn’t running. Start it with ollama serve. On Linux, if you installed via the script, it runs as a systemd service:

1
2
sudo systemctl status ollama
sudo systemctl start ollama

Slow generation on a machine with a GPU

Verify the GPU is actually being used. In Ollama, check ollama ps – it shows which device the model is loaded on. In llama.cpp, the startup log prints lines like llm_load_tensors: offloading 32 layers to GPU. If you see 0 layers offloaded, rebuild with CUDA or Metal support.

Docker containers can’t reach Ollama

localhost:11434 inside a container points to the container, not the host. Use host.docker.internal:11434 on macOS/Windows, or 172.17.0.1:11434 on Linux. Alternatively, run Ollama inside the same Docker network.

Performance Tuning

A few flags that make a measurable difference in llama.cpp:

  • -ngl: Offload as many layers as your VRAM allows. Every layer on the GPU is faster than CPU.
  • -c: Larger context windows consume more memory. If you don’t need 8K context, drop to 2048 or 4096.
  • -b (batch size): Higher values speed up prompt processing at the cost of more memory. Default is 2048; try 512 if you’re memory constrained.
  • --mlock: Locks the model in RAM, preventing the OS from swapping it to disk. Important on machines where swap kills performance.
  • --merge-qkv: Merges Q, K, V attention tensors for faster token generation when layers are offloaded to GPU.

On NVIDIA GPUs with multi-GPU setups, set CUDA_SCALE_LAUNCH_QUEUES=4x to increase the command buffer size. This improves prompt processing throughput significantly with pipeline parallelism.

For Ollama, most of these are handled automatically. You can override defaults by setting environment variables like OLLAMA_NUM_PARALLEL=4 for concurrent request handling or OLLAMA_MAX_LOADED_MODELS=2 to keep multiple models in memory.

When to Use Which

Choose Ollama when:

  • You want to pull and run models with zero configuration
  • You need a stable API server for app development
  • You don’t want to manage GGUF files manually

Choose llama.cpp when:

  • You need precise control over GPU layer offloading
  • You’re running models not yet in the Ollama registry
  • You want to quantize your own models from scratch
  • You need to tune batch size, context length, and memory locking per deployment

Both tools are actively maintained and handle the same core task. Ollama is llama.cpp with batteries included. llama.cpp is the engine you drop into when the batteries don’t fit your chassis.