Ollama gives you a one-command path to running models locally. llama.cpp gives you bare-metal control over every inference parameter. Both use GGUF model files, both run on CPU or GPU, and both expose an OpenAI-compatible API. Pick Ollama when you want things to work immediately. Pick llama.cpp when you need to tune every knob.
Here is the fastest way to get a model running on your machine with each tool.
Install and Run Ollama
Ollama packages llama.cpp behind a clean CLI. Install it on Linux with a single command:
| |
On macOS, download the app from ollama.com or use Homebrew:
| |
Pull and run a model in one step:
| |
That command downloads the 4-bit quantized GGUF file (~4.7 GB), loads it, and drops you into an interactive chat. Ollama auto-detects your GPU – NVIDIA on Linux, Metal on macOS – and offloads layers without any flags.
To run the model as a background server instead of an interactive session:
| |
List your downloaded models and their sizes:
| |
Hardware Requirements
The rule of thumb: you need roughly the model’s disk size in RAM (or VRAM for GPU inference) plus 1-2 GB overhead for KV cache and runtime buffers.
| Model Size | Quantization | Disk / RAM | Minimum VRAM |
|---|---|---|---|
| 7-8B | Q4_K_M | ~4.5 GB | 6 GB |
| 13B | Q4_K_M | ~7.5 GB | 10 GB |
| 34B | Q4_K_M | ~20 GB | 24 GB |
| 70B | Q4_K_M | ~40 GB | 48 GB |
If the model doesn’t fit entirely in VRAM, both Ollama and llama.cpp split layers between GPU and CPU automatically. You get GPU-speed generation for the offloaded layers and CPU-speed for the rest. A 70B Q4 model on a 24 GB GPU still works – it’s just slower than full offload.
Running on CPU only is viable for 7-8B models on machines with 16+ GB of RAM. Expect 5-15 tokens/second depending on your processor.
Use the OpenAI-Compatible API
Ollama exposes an API on port 11434 that mimics the OpenAI chat completions endpoint. Any tool built for the OpenAI SDK works with a URL swap:
| |
In Python with the OpenAI client:
| |
This means you can swap a local model into any LangChain, LiteLLM, or custom app that targets the OpenAI API. Set the base URL, pick a model name from ollama list, and everything else stays the same.
Build llama.cpp from Source
When you need to control GPU layer count, context size, batch parameters, or run models Ollama doesn’t support yet, build llama.cpp directly.
| |
Download a GGUF model from Hugging Face (Bartowski, TheBloke, and the official model repos all publish pre-quantized GGUFs):
| |
Run inference from the CLI:
| |
The -ngl 99 flag offloads all model layers to the GPU. If you don’t have enough VRAM, lower this number – llama.cpp will keep the remaining layers on CPU. The -c 4096 flag sets the context window size.
Run llama.cpp as an API Server
llama.cpp ships llama-server, which exposes the same OpenAI-compatible endpoint:
| |
Now you can hit http://localhost:8080/v1/chat/completions with the same curl or Python code from the Ollama section. Just change the port.
For production-like setups, add --parallel 4 to handle multiple concurrent requests and --cont-batching for continuous batching. These flags make a real difference under load.
GGUF Quantization Formats
GGUF files come in different quantization levels. The naming convention tells you the precision:
- Q4_K_M – 4-bit quantization, medium quality. Best balance of size and quality for most use cases.
- Q5_K_M – 5-bit. Slightly better output quality, ~25% larger files.
- Q8_0 – 8-bit. Near-original quality, roughly double the size of Q4.
- Q2_K – 2-bit. Aggressive compression. Noticeable quality loss, but fits large models in limited RAM.
- F16 – 16-bit float. No quality loss. Use only if you have the VRAM to spare.
For a 7-8B model, start with Q4_K_M. If the outputs feel off, step up to Q5_K_M. For 70B+ models where VRAM is tight, Q4_K_M or even Q3_K_M is usually the pragmatic choice.
Troubleshooting Common Errors
Ollama returns “Error: model requires more system memory”
The model doesn’t fit in available RAM + VRAM. Switch to a smaller quantization (ollama run llama3.1:8b-q4_0 instead of the default) or close other applications to free memory.
llama.cpp crashes with CUDA error: out of memory
Lower the -ngl value. Start with -ngl 20 and increase until you hit your VRAM ceiling. You can check per-layer memory usage in the startup logs – llama.cpp prints the size of each offloaded layer.
Ollama shows connection refused on port 11434
The Ollama server isn’t running. Start it with ollama serve. On Linux, if you installed via the script, it runs as a systemd service:
| |
Slow generation on a machine with a GPU
Verify the GPU is actually being used. In Ollama, check ollama ps – it shows which device the model is loaded on. In llama.cpp, the startup log prints lines like llm_load_tensors: offloading 32 layers to GPU. If you see 0 layers offloaded, rebuild with CUDA or Metal support.
Docker containers can’t reach Ollama
localhost:11434 inside a container points to the container, not the host. Use host.docker.internal:11434 on macOS/Windows, or 172.17.0.1:11434 on Linux. Alternatively, run Ollama inside the same Docker network.
Performance Tuning
A few flags that make a measurable difference in llama.cpp:
-ngl: Offload as many layers as your VRAM allows. Every layer on the GPU is faster than CPU.-c: Larger context windows consume more memory. If you don’t need 8K context, drop to 2048 or 4096.-b(batch size): Higher values speed up prompt processing at the cost of more memory. Default is 2048; try 512 if you’re memory constrained.--mlock: Locks the model in RAM, preventing the OS from swapping it to disk. Important on machines where swap kills performance.--merge-qkv: Merges Q, K, V attention tensors for faster token generation when layers are offloaded to GPU.
On NVIDIA GPUs with multi-GPU setups, set CUDA_SCALE_LAUNCH_QUEUES=4x to increase the command buffer size. This improves prompt processing throughput significantly with pipeline parallelism.
For Ollama, most of these are handled automatically. You can override defaults by setting environment variables like OLLAMA_NUM_PARALLEL=4 for concurrent request handling or OLLAMA_MAX_LOADED_MODELS=2 to keep multiple models in memory.
When to Use Which
Choose Ollama when:
- You want to pull and run models with zero configuration
- You need a stable API server for app development
- You don’t want to manage GGUF files manually
Choose llama.cpp when:
- You need precise control over GPU layer offloading
- You’re running models not yet in the Ollama registry
- You want to quantize your own models from scratch
- You need to tune batch size, context length, and memory locking per deployment
Both tools are actively maintained and handle the same core task. Ollama is llama.cpp with batteries included. llama.cpp is the engine you drop into when the batteries don’t fit your chassis.
Related Guides
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Optimize Docker Images for ML Model Serving
- How to Quantize LLMs with GPTQ and AWQ
- How to Serve ML Models with NVIDIA Triton Inference Server
- How to Compile and Optimize PyTorch Models with torch.compile
- How to Optimize Model Inference with ONNX Runtime
- How to Deploy Models to Edge Devices with ONNX and TensorRT
- How to Set Up Multi-GPU Training with PyTorch
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Set Up Distributed Training with DeepSpeed and ZeRO