Qwen3-Embedding-8B is the current top model on the MTEB multilingual leaderboard—70.58 average score, beating Google’s Gemini embedding model. Pair it with Qdrant’s native hybrid search and you get a retrieval pipeline that handles both exact keyword matches and semantic similarity without any hacks. Here’s the complete setup.

Pure dense vector search is excellent for semantic similarity but it drops exact matches. If someone asks for “CVE-2024-3094” and your documents contain that exact string, a dense search may not rank it first because the embedding doesn’t encode character-level patterns. BM25 fixes that—it’s essentially weighted term frequency, so exact and near-exact matches always score well.

Hybrid search runs both in parallel and fuses the ranked results using Reciprocal Rank Fusion (RRF). Documents that rank well in both searches bubble to the top; documents that only rank in one get a smaller boost. In practice this gives you 5-15% retrieval improvement over either method alone on mixed corpora.

Prerequisites and Installation

You need a running Qdrant instance. The fastest way locally:

1
2
3
docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Then install the Python dependencies:

1
2
3
4
pip install "qdrant-client>=1.9.0"
pip install "sentence-transformers>=3.0.0"
pip install "transformers>=4.51.0"
pip install torch fastembed

fastembed is used for BM25 sparse vector generation on the Qdrant side—it includes the Qdrant/bm25 sparse model that integrates directly with the client. sentence-transformers>=3.0.0 is required for Qwen3’s instruction-aware encoding.

Step 1 — Embed Documents with Qwen3-Embedding-8B

Qwen3-Embedding-8B supports task-specific instructions on the query side, which gives 1-5% retrieval improvement. The document side doesn’t use instructions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sentence_transformers import SentenceTransformer
import torch

embedding_model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-8B",
    model_kwargs={
        "attn_implementation": "flash_attention_2",  # remove if not installed
        "device_map": "auto",
        "torch_dtype": torch.float16,
    },
    tokenizer_kwargs={"padding_side": "left"},
)

# Your documents — any length up to 32K tokens
documents = [
    "Qdrant is a vector database written in Rust, designed for high-performance similarity search.",
    "BM25 is a probabilistic information retrieval function used in search engines.",
    "Reciprocal Rank Fusion combines ranked lists from multiple retrieval methods.",
    "Qwen3-Embedding-8B achieves 70.58 on the MTEB multilingual benchmark.",
    "Flash Attention 2 reduces memory usage and speeds up attention computation for long sequences.",
]

# Encode documents — no instruction prefix for documents
doc_embeddings = embedding_model.encode(
    documents,
    batch_size=8,
    normalize_embeddings=True,  # cosine similarity requires normalized vectors
    show_progress_bar=True,
)

print(f"Embedding shape: {doc_embeddings.shape}")
# Embedding shape: (5, 4096)

The embedding dimension is 4096. If you need smaller vectors (e.g., for cost or speed reasons), Qwen3 supports Matryoshka Representation Learning—you can truncate to 512 or 1024 dimensions with minimal quality loss:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Truncate to 1024 dimensions — faster storage and retrieval
import numpy as np
import torch.nn.functional as F
import torch

def truncate_and_normalize(embeddings: np.ndarray, dim: int) -> np.ndarray:
    t = torch.tensor(embeddings)
    t = t[:, :dim]
    t = F.normalize(t, p=2, dim=1)
    return t.numpy()

doc_embeddings_1024 = truncate_and_normalize(doc_embeddings, 1024)

Step 2 — Create a Hybrid Qdrant Collection

Qdrant stores both dense and sparse vectors per point. The sparse vectors power BM25; the dense vectors power semantic search.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")
COLLECTION_NAME = "knowledge_base"
VECTOR_DIM = 4096  # or 1024 if using truncation

# Create collection with both vector types
client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        "dense": models.VectorParams(
            size=VECTOR_DIM,
            distance=models.Distance.COSINE,
        )
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(on_disk=False)
        )
    },
)
print(f"Collection '{COLLECTION_NAME}' created.")

Step 3 — Generate BM25 Sparse Vectors and Upsert

fastembed’s SparseTextEmbedding generates BM25 sparse vectors locally without any external API call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from fastembed import SparseTextEmbedding

bm25_model = SparseTextEmbedding(model_name="Qdrant/bm25")

# Generate sparse vectors for all documents
sparse_embeddings = list(bm25_model.embed(documents))

# Build PointStructs with both vector types
points = []
for idx, (doc, dense_vec, sparse_vec) in enumerate(
    zip(documents, doc_embeddings, sparse_embeddings)
):
    points.append(
        models.PointStruct(
            id=idx,
            vector={
                "dense": dense_vec.tolist(),
                "sparse": models.SparseVector(
                    indices=sparse_vec.indices.tolist(),
                    values=sparse_vec.values.tolist(),
                ),
            },
            payload={"text": doc, "doc_id": idx},
        )
    )

# Upsert all points
client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Upserted {len(points)} documents.")

Step 4 — Hybrid Search at Query Time

This is where RRF fusion happens. Qdrant’s query_points API runs both searches as prefetch operations, then fuses the results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
    # Encode query with task instruction — this is the key Qwen3 feature
    task_instruction = "Given a user question, retrieve relevant passages that answer the question"
    query_with_instruction = f"Instruct: {task_instruction}\nQuery:{query}"

    query_dense = embedding_model.encode(
        [query_with_instruction],
        normalize_embeddings=True,
    )[0].tolist()

    # Generate sparse vector for query
    query_sparse = list(bm25_model.embed([query]))[0]

    results = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            # Dense semantic search — retrieve top 50 candidates
            models.Prefetch(
                query=query_dense,
                using="dense",
                limit=50,
            ),
            # BM25 sparse search — retrieve top 50 candidates
            models.Prefetch(
                query=models.SparseVector(
                    indices=query_sparse.indices.tolist(),
                    values=query_sparse.values.tolist(),
                ),
                using="sparse",
                limit=50,
            ),
        ],
        # Fuse with Reciprocal Rank Fusion, return top_k
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=top_k,
        with_payload=True,
    )

    return [
        {"text": r.payload["text"], "score": r.score, "id": r.id}
        for r in results.points
    ]


# Test it
hits = hybrid_search("what is BM25 used for?")
for hit in hits:
    print(f"[{hit['score']:.4f}] {hit['text'][:80]}")

Step 5 — Generate with an LLM

The retrieval half is done. Pass the results to any LLM:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from openai import OpenAI

llm_client = OpenAI(api_key="your-key-here")  # or point to a local vLLM server

def rag_answer(question: str) -> str:
    chunks = hybrid_search(question, top_k=3)
    context = "\n\n".join([f"[{i+1}] {c['text']}" for i, c in enumerate(chunks)])

    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",  # or any model you're running locally
        messages=[
            {
                "role": "system",
                "content": "Answer the question using only the provided context. Be concise.",
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            },
        ],
        temperature=0,
        max_tokens=512,
    )
    return response.choices[0].message.content

answer = rag_answer("What benchmarks does Qwen3-Embedding achieve?")
print(answer)

Evaluation and Tuning

Measuring Retrieval Quality

Before tuning, you need a baseline. If you have question-answer pairs from your domain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def evaluate_retrieval(qa_pairs: list[dict], top_k: int = 5) -> dict:
    """
    qa_pairs: [{"question": "...", "relevant_doc_ids": [0, 3]}, ...]
    Returns hit rate and MRR.
    """
    hit_count = 0
    reciprocal_ranks = []

    for pair in qa_pairs:
        hits = hybrid_search(pair["question"], top_k=top_k)
        retrieved_ids = [h["id"] for h in hits]
        relevant = set(pair["relevant_doc_ids"])

        # Hit rate: did any relevant doc appear in top_k?
        if any(rid in relevant for rid in retrieved_ids):
            hit_count += 1

        # MRR: rank of first relevant document
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)

    hit_rate = hit_count / len(qa_pairs)
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return {"hit_rate": hit_rate, "mrr": mrr}

Tuning Tips

Prefetch limit: The 50-candidate prefetch is a starting point. If your corpus is over 100K documents, push this to 100 or 200. RRF is cheap; the bottleneck is ANN search accuracy at higher recall levels.

Embedding truncation vs full 4096: Run your evaluation both ways. For most English-only corpora, 1024 dimensions loses under 1% hit rate while cutting storage and search latency by 4x. For multilingual corpora, stay at 4096.

Task instructions matter: The Instruct: ... \nQuery: prefix is not optional. Qwen3-Embedding was trained to use it—omitting it measurably drops retrieval scores on out-of-domain queries. You only add this to queries, not to documents.

Chunk size: Qwen3’s 32K context window means you can embed very long passages, but shorter chunks (256-512 tokens) usually give better retrieval precision. The model captures long-range context well, but matching against a specific paragraph is easier than matching against a 10-page document.

Common Issues

KeyError: 'qwen3' on model load: You’re on transformers<4.51.0. Upgrade:

1
pip install "transformers>=4.51.0"

Qdrant ValueError: Sparse vector indices must be sorted: fastembed normally returns sorted indices, but double-check when building PointStructs manually. Sort them:

1
2
indices, values = zip(*sorted(zip(sparse_vec.indices, sparse_vec.values)))
sparse_vector = models.SparseVector(indices=list(indices), values=list(values))

OOM with Qwen3-Embedding-8B on 16 GB GPU: Use float16 explicitly and reduce batch size:

1
2
3
4
5
6
embedding_model = SentenceTransformer(
    "Qwen/Qwen3-Embedding-8B",
    model_kwargs={"torch_dtype": torch.float16, "device_map": "auto"},
    tokenizer_kwargs={"padding_side": "left"},
)
doc_embeddings = embedding_model.encode(documents, batch_size=2)

If you’re still tight on memory, switch to the 4B variant—Qwen/Qwen3-Embedding-4B—which scores 69.45 on MTEB multilingual and fits comfortably in 12 GB.