Qwen3-Embedding-8B is the current top model on the MTEB multilingual leaderboard—70.58 average score, beating Google’s Gemini embedding model. Pair it with Qdrant’s native hybrid search and you get a retrieval pipeline that handles both exact keyword matches and semantic similarity without any hacks. Here’s the complete setup.
Why Hybrid Search#
Pure dense vector search is excellent for semantic similarity but it drops exact matches. If someone asks for “CVE-2024-3094” and your documents contain that exact string, a dense search may not rank it first because the embedding doesn’t encode character-level patterns. BM25 fixes that—it’s essentially weighted term frequency, so exact and near-exact matches always score well.
Hybrid search runs both in parallel and fuses the ranked results using Reciprocal Rank Fusion (RRF). Documents that rank well in both searches bubble to the top; documents that only rank in one get a smaller boost. In practice this gives you 5-15% retrieval improvement over either method alone on mixed corpora.
Prerequisites and Installation#
You need a running Qdrant instance. The fastest way locally:
1
2
3
| docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
|
Then install the Python dependencies:
1
2
3
4
| pip install "qdrant-client>=1.9.0"
pip install "sentence-transformers>=3.0.0"
pip install "transformers>=4.51.0"
pip install torch fastembed
|
fastembed is used for BM25 sparse vector generation on the Qdrant side—it includes the Qdrant/bm25 sparse model that integrates directly with the client. sentence-transformers>=3.0.0 is required for Qwen3’s instruction-aware encoding.
Step 1 — Embed Documents with Qwen3-Embedding-8B#
Qwen3-Embedding-8B supports task-specific instructions on the query side, which gives 1-5% retrieval improvement. The document side doesn’t use instructions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| from sentence_transformers import SentenceTransformer
import torch
embedding_model = SentenceTransformer(
"Qwen/Qwen3-Embedding-8B",
model_kwargs={
"attn_implementation": "flash_attention_2", # remove if not installed
"device_map": "auto",
"torch_dtype": torch.float16,
},
tokenizer_kwargs={"padding_side": "left"},
)
# Your documents — any length up to 32K tokens
documents = [
"Qdrant is a vector database written in Rust, designed for high-performance similarity search.",
"BM25 is a probabilistic information retrieval function used in search engines.",
"Reciprocal Rank Fusion combines ranked lists from multiple retrieval methods.",
"Qwen3-Embedding-8B achieves 70.58 on the MTEB multilingual benchmark.",
"Flash Attention 2 reduces memory usage and speeds up attention computation for long sequences.",
]
# Encode documents — no instruction prefix for documents
doc_embeddings = embedding_model.encode(
documents,
batch_size=8,
normalize_embeddings=True, # cosine similarity requires normalized vectors
show_progress_bar=True,
)
print(f"Embedding shape: {doc_embeddings.shape}")
# Embedding shape: (5, 4096)
|
The embedding dimension is 4096. If you need smaller vectors (e.g., for cost or speed reasons), Qwen3 supports Matryoshka Representation Learning—you can truncate to 512 or 1024 dimensions with minimal quality loss:
1
2
3
4
5
6
7
8
9
10
11
12
| # Truncate to 1024 dimensions — faster storage and retrieval
import numpy as np
import torch.nn.functional as F
import torch
def truncate_and_normalize(embeddings: np.ndarray, dim: int) -> np.ndarray:
t = torch.tensor(embeddings)
t = t[:, :dim]
t = F.normalize(t, p=2, dim=1)
return t.numpy()
doc_embeddings_1024 = truncate_and_normalize(doc_embeddings, 1024)
|
Step 2 — Create a Hybrid Qdrant Collection#
Qdrant stores both dense and sparse vectors per point. The sparse vectors power BM25; the dense vectors power semantic search.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from qdrant_client import QdrantClient, models
client = QdrantClient("http://localhost:6333")
COLLECTION_NAME = "knowledge_base"
VECTOR_DIM = 4096 # or 1024 if using truncation
# Create collection with both vector types
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config={
"dense": models.VectorParams(
size=VECTOR_DIM,
distance=models.Distance.COSINE,
)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
},
)
print(f"Collection '{COLLECTION_NAME}' created.")
|
Step 3 — Generate BM25 Sparse Vectors and Upsert#
fastembed’s SparseTextEmbedding generates BM25 sparse vectors locally without any external API call:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from fastembed import SparseTextEmbedding
bm25_model = SparseTextEmbedding(model_name="Qdrant/bm25")
# Generate sparse vectors for all documents
sparse_embeddings = list(bm25_model.embed(documents))
# Build PointStructs with both vector types
points = []
for idx, (doc, dense_vec, sparse_vec) in enumerate(
zip(documents, doc_embeddings, sparse_embeddings)
):
points.append(
models.PointStruct(
id=idx,
vector={
"dense": dense_vec.tolist(),
"sparse": models.SparseVector(
indices=sparse_vec.indices.tolist(),
values=sparse_vec.values.tolist(),
),
},
payload={"text": doc, "doc_id": idx},
)
)
# Upsert all points
client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Upserted {len(points)} documents.")
|
Step 4 — Hybrid Search at Query Time#
This is where RRF fusion happens. Qdrant’s query_points API runs both searches as prefetch operations, then fuses the results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
# Encode query with task instruction — this is the key Qwen3 feature
task_instruction = "Given a user question, retrieve relevant passages that answer the question"
query_with_instruction = f"Instruct: {task_instruction}\nQuery:{query}"
query_dense = embedding_model.encode(
[query_with_instruction],
normalize_embeddings=True,
)[0].tolist()
# Generate sparse vector for query
query_sparse = list(bm25_model.embed([query]))[0]
results = client.query_points(
collection_name=COLLECTION_NAME,
prefetch=[
# Dense semantic search — retrieve top 50 candidates
models.Prefetch(
query=query_dense,
using="dense",
limit=50,
),
# BM25 sparse search — retrieve top 50 candidates
models.Prefetch(
query=models.SparseVector(
indices=query_sparse.indices.tolist(),
values=query_sparse.values.tolist(),
),
using="sparse",
limit=50,
),
],
# Fuse with Reciprocal Rank Fusion, return top_k
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=top_k,
with_payload=True,
)
return [
{"text": r.payload["text"], "score": r.score, "id": r.id}
for r in results.points
]
# Test it
hits = hybrid_search("what is BM25 used for?")
for hit in hits:
print(f"[{hit['score']:.4f}] {hit['text'][:80]}")
|
Step 5 — Generate with an LLM#
The retrieval half is done. Pass the results to any LLM:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| from openai import OpenAI
llm_client = OpenAI(api_key="your-key-here") # or point to a local vLLM server
def rag_answer(question: str) -> str:
chunks = hybrid_search(question, top_k=3)
context = "\n\n".join([f"[{i+1}] {c['text']}" for i, c in enumerate(chunks)])
response = llm_client.chat.completions.create(
model="gpt-4o-mini", # or any model you're running locally
messages=[
{
"role": "system",
"content": "Answer the question using only the provided context. Be concise.",
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
temperature=0,
max_tokens=512,
)
return response.choices[0].message.content
answer = rag_answer("What benchmarks does Qwen3-Embedding achieve?")
print(answer)
|
Evaluation and Tuning#
Measuring Retrieval Quality#
Before tuning, you need a baseline. If you have question-answer pairs from your domain:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| def evaluate_retrieval(qa_pairs: list[dict], top_k: int = 5) -> dict:
"""
qa_pairs: [{"question": "...", "relevant_doc_ids": [0, 3]}, ...]
Returns hit rate and MRR.
"""
hit_count = 0
reciprocal_ranks = []
for pair in qa_pairs:
hits = hybrid_search(pair["question"], top_k=top_k)
retrieved_ids = [h["id"] for h in hits]
relevant = set(pair["relevant_doc_ids"])
# Hit rate: did any relevant doc appear in top_k?
if any(rid in relevant for rid in retrieved_ids):
hit_count += 1
# MRR: rank of first relevant document
for rank, doc_id in enumerate(retrieved_ids, start=1):
if doc_id in relevant:
reciprocal_ranks.append(1 / rank)
break
else:
reciprocal_ranks.append(0)
hit_rate = hit_count / len(qa_pairs)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
return {"hit_rate": hit_rate, "mrr": mrr}
|
Tuning Tips#
Prefetch limit: The 50-candidate prefetch is a starting point. If your corpus is over 100K documents, push this to 100 or 200. RRF is cheap; the bottleneck is ANN search accuracy at higher recall levels.
Embedding truncation vs full 4096: Run your evaluation both ways. For most English-only corpora, 1024 dimensions loses under 1% hit rate while cutting storage and search latency by 4x. For multilingual corpora, stay at 4096.
Task instructions matter: The Instruct: ... \nQuery: prefix is not optional. Qwen3-Embedding was trained to use it—omitting it measurably drops retrieval scores on out-of-domain queries. You only add this to queries, not to documents.
Chunk size: Qwen3’s 32K context window means you can embed very long passages, but shorter chunks (256-512 tokens) usually give better retrieval precision. The model captures long-range context well, but matching against a specific paragraph is easier than matching against a 10-page document.
Common Issues#
KeyError: 'qwen3' on model load:
You’re on transformers<4.51.0. Upgrade:
1
| pip install "transformers>=4.51.0"
|
Qdrant ValueError: Sparse vector indices must be sorted:
fastembed normally returns sorted indices, but double-check when building PointStructs manually. Sort them:
1
2
| indices, values = zip(*sorted(zip(sparse_vec.indices, sparse_vec.values)))
sparse_vector = models.SparseVector(indices=list(indices), values=list(values))
|
OOM with Qwen3-Embedding-8B on 16 GB GPU:
Use float16 explicitly and reduce batch size:
1
2
3
4
5
6
| embedding_model = SentenceTransformer(
"Qwen/Qwen3-Embedding-8B",
model_kwargs={"torch_dtype": torch.float16, "device_map": "auto"},
tokenizer_kwargs={"padding_side": "left"},
)
doc_embeddings = embedding_model.encode(documents, batch_size=2)
|
If you’re still tight on memory, switch to the 4B variant—Qwen/Qwen3-Embedding-4B—which scores 69.45 on MTEB multilingual and fits comfortably in 12 GB.