Token costs add up fast when you’re running LLM apps at scale. A single RAG query can burn through thousands of tokens just on context, and those charges hit every single request. The good news? You can compress prompts by 50-80% without losing quality if you know what you’re doing.
Here’s how to measure token usage, compress prompts intelligently, and build cost-efficient pipelines that don’t sacrifice performance.
Measure Token Counts First#
Before optimizing, you need to know what you’re dealing with. Use tiktoken to count tokens exactly how OpenAI does it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
"""Count tokens for a given text using the model's tokenizer."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example: Check a typical RAG prompt
context = """Long document context here...""" * 10 # Simulate big context
query = "What are the main benefits of prompt compression?"
prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
token_count = count_tokens(prompt)
cost_per_1k = 0.03 # GPT-4 input pricing
print(f"Tokens: {token_count:,}")
print(f"Cost per request: ${(token_count / 1000) * cost_per_1k:.4f}")
|
This gives you a baseline. If you’re sending 10k tokens per request and running 10k requests/day, you’re looking at significant input costs every month. That’s your motivation to compress.
Use LLMLingua for Aggressive Compression#
LLMLingua is the best tool for prompt compression right now. It uses a smaller language model to identify and remove non-essential tokens while preserving semantic meaning. You can hit 50-80% compression with minimal quality loss.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from llmlingua import PromptCompressor
# Initialize the compressor (uses a small model locally)
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True # Latest version, better quality
)
# Compress a long context
original_prompt = """
The quarterly financial report shows revenue growth of 23% year-over-year.
Operating expenses increased by 12%, primarily due to expanded marketing
campaigns and new product development costs. Net profit margin improved
from 18% to 21%. Customer acquisition cost decreased by 8% while lifetime
value increased by 15%. The company projects continued growth in Q2 with
expected revenue between $45M and $50M.
"""
compressed = compressor.compress_prompt(
original_prompt,
rate=0.5, # Target 50% compression
force_tokens=['\n', '?', '.', '!'] # Preserve structure
)
print(f"Original: {count_tokens(original_prompt)} tokens")
print(f"Compressed: {count_tokens(compressed['compressed_prompt'])} tokens")
print(f"Compression ratio: {compressed['ratio']:.2%}")
print(f"\nCompressed text:\n{compressed['compressed_prompt']}")
|
The key is the rate parameter. Start at 0.5 (50% compression) and tune based on your quality needs. Financial data or code? Go lighter. General content? You can push to 0.7-0.8.
Selective Context Pruning for RAG#
Don’t send your entire vector database retrieval to the LLM. Prune it intelligently based on relevance scores and semantic clustering.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def prune_context(
query_embedding: np.ndarray,
retrieved_chunks: list[dict], # [{'text': str, 'embedding': np.ndarray, 'score': float}]
max_tokens: int = 2000,
similarity_threshold: float = 0.7
) -> str:
"""
Prune retrieved chunks to fit token budget while maximizing relevance.
"""
# Filter by similarity threshold
relevant_chunks = [
chunk for chunk in retrieved_chunks
if cosine_similarity([query_embedding], [chunk['embedding']])[0][0] >= similarity_threshold
]
# Sort by relevance score
relevant_chunks.sort(key=lambda x: x['score'], reverse=True)
# Pack chunks until token budget is reached
pruned_context = []
current_tokens = 0
for chunk in relevant_chunks:
chunk_tokens = count_tokens(chunk['text'])
if current_tokens + chunk_tokens <= max_tokens:
pruned_context.append(chunk['text'])
current_tokens += chunk_tokens
else:
break
return "\n\n".join(pruned_context)
# Example usage
query_emb = np.random.rand(1536) # Simulated embedding
chunks = [
{'text': 'Chunk 1...', 'embedding': np.random.rand(1536), 'score': 0.92},
{'text': 'Chunk 2...', 'embedding': np.random.rand(1536), 'score': 0.85},
{'text': 'Chunk 3...', 'embedding': np.random.rand(1536), 'score': 0.65},
]
context = prune_context(query_emb, chunks, max_tokens=1000, similarity_threshold=0.75)
print(f"Pruned context tokens: {count_tokens(context)}")
|
This approach is simple but effective. You’re not compressing the text itself, just being smarter about what you send. Combine this with LLMLingua for maximum savings.
Summarization-Based Context Reduction#
For long documents, summarize them once and cache the summary. Don’t compress the same content repeatedly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| from openai import OpenAI
from functools import lru_cache
import hashlib
client = OpenAI()
@lru_cache(maxsize=1000)
def get_cached_summary(content_hash: str, original_content: str) -> str:
"""Cache summaries to avoid redundant API calls."""
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for summarization
messages=[
{"role": "system", "content": "Summarize the following text concisely, preserving key facts and figures."},
{"role": "user", "content": original_content}
],
max_tokens=500
)
return response.choices[0].message.content
def summarize_context(text: str, max_summary_tokens: int = 500) -> str:
"""Summarize long context for token reduction."""
content_hash = hashlib.md5(text.encode()).hexdigest()
summary = get_cached_summary(content_hash, text)
# Verify summary fits budget
if count_tokens(summary) > max_summary_tokens:
# Fallback: truncate
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode(summary)
summary = encoding.decode(tokens[:max_summary_tokens])
return summary
# Use in a RAG pipeline
long_document = "..." * 1000 # Simulated long doc
if count_tokens(long_document) > 3000:
context = summarize_context(long_document, max_summary_tokens=1000)
else:
context = long_document
|
The caching here is critical. You don’t want to pay to summarize the same document 100 times. Use a persistent cache (Redis, DynamoDB) in production.
Prompt Caching#
Both OpenAI and Anthropic offer prompt caching for repeated content. If you’re sending the same system message or context across requests, you get a discount on cached tokens – 50% with OpenAI, 90% with Anthropic.
OpenAI’s caching is automatic. If you send the same prefix across requests, it caches it server-side:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from openai import OpenAI
client = OpenAI()
# Both requests share the same system message -- OpenAI caches it automatically
for question in ["What is prompt compression?", "How does LLMLingua work?"]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant specialized in NLP and machine learning. "
"You provide concise, technical answers with code examples when relevant."},
{"role": "user", "content": question}
]
)
print(response.usage) # Check cached_tokens in the usage object
|
Cache the parts of your prompt that don’t change: system instructions, few-shot examples, static context. Only the variable parts (user query) incur full costs. The cached prefix must be at least 1,024 tokens for OpenAI to cache it.
Common Errors and Fixes#
LLMLingua installation fails on M1 Macs
Use conda instead of pip: conda install -c conda-forge llmlingua. The PyTorch dependencies can be tricky with ARM chips.
Compressed prompts produce gibberish responses
You compressed too aggressively. Lower the compression rate from 0.8 to 0.5 or 0.6. Also check that you’re preserving sentence boundaries with force_tokens=['.', '!', '?'].
Caching doesn’t reduce costs
Make sure the cached content is identical byte-for-byte across requests. A single space difference breaks the cache. Hash your system prompts and verify consistency.
RAG context pruning loses critical information
Your similarity threshold is too high. Lower it from 0.7 to 0.6, or switch to a hybrid approach that always includes the top-3 chunks regardless of score.
Token counts don’t match OpenAI billing
You’re using the wrong tokenizer. Always use tiktoken.encoding_for_model(model_name) to match the exact tokenizer for your model. GPT-3.5 and GPT-4 use different encodings.