The Quick Version#
Extractive QA finds the exact text span in a document that answers a question. Unlike generative QA (where an LLM writes an answer), extractive QA points to specific words in the source — which means it can’t hallucinate. The answer is always a direct quote from the context.
1
| pip install transformers torch
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| from transformers import pipeline
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
context = """
The Transformer architecture was introduced in the paper "Attention Is All You Need"
by Vaswani et al. in 2017. It replaced recurrent neural networks with self-attention
mechanisms, enabling much faster training through parallelization. The original
Transformer had 65 million parameters and was trained on the WMT 2014 English-to-German
translation dataset. BERT, introduced by Google in 2018, used only the encoder part
of the Transformer for language understanding tasks.
"""
result = qa(question="When was the Transformer architecture introduced?", context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
print(f"Span: [{result['start']}:{result['end']}]")
# Answer: 2017
# Score: 0.9847
# Span: [123:127]
|
The model returns the answer text, a confidence score, and the exact character positions in the context. A score above 0.5 generally means the model is confident. Below 0.1 usually means the answer isn’t in the context.
Handling Long Documents#
The model has a 512-token limit. For longer documents, split into overlapping chunks and run QA on each:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
def answer_long_document(question: str, document: str, chunk_size: int = 384, stride: int = 128) -> dict:
"""Run QA across overlapping chunks of a long document."""
# Tokenize with sliding window
encodings = tokenizer(
question,
document,
max_length=chunk_size,
stride=stride,
truncation="only_second",
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
return_tensors="pt",
)
best_answer = {"answer": "", "score": 0.0, "start": 0, "end": 0}
for i in range(encodings["input_ids"].shape[0]):
input_ids = encodings["input_ids"][i].unsqueeze(0)
attention_mask = encodings["attention_mask"][i].unsqueeze(0)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
start_logits = outputs.start_logits.squeeze()
end_logits = outputs.end_logits.squeeze()
# Find the best start-end span
start_idx = torch.argmax(start_logits).item()
end_idx = torch.argmax(end_logits).item()
if end_idx >= start_idx:
score = (start_logits[start_idx] + end_logits[end_idx]).item()
if score > best_answer["score"]:
# Map token positions back to character positions
offsets = encodings["offset_mapping"][i]
start_char = offsets[start_idx][0].item()
end_char = offsets[end_idx][1].item()
answer_tokens = input_ids[0][start_idx:end_idx + 1]
answer_text = tokenizer.decode(answer_tokens, skip_special_tokens=True)
best_answer = {
"answer": answer_text,
"score": score,
"start": start_char,
"end": end_char,
"chunk": i,
}
return best_answer
# Works on documents of any length
long_doc = open("research_paper.txt").read() # could be 10,000+ words
result = answer_long_document("What dataset was used for training?", long_doc)
print(f"Answer: {result['answer']} (score: {result['score']:.2f})")
|
The stride parameter controls overlap between chunks. 128 tokens of overlap ensures questions that span chunk boundaries still get answered correctly.
Multi-Document QA#
Search across multiple documents and rank answers by confidence:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| def search_and_answer(question: str, documents: list[dict], top_k: int = 3) -> list[dict]:
"""Find answers across multiple documents, ranked by confidence."""
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
all_answers = []
for doc in documents:
try:
result = qa(question=question, context=doc["text"][:2000])
all_answers.append({
"answer": result["answer"],
"score": result["score"],
"source": doc["title"],
"start": result["start"],
"end": result["end"],
})
except Exception:
continue
# Sort by confidence score
all_answers.sort(key=lambda x: x["score"], reverse=True)
return all_answers[:top_k]
documents = [
{"title": "Python Basics", "text": "Python was created by Guido van Rossum and first released in 1991..."},
{"title": "Java History", "text": "Java was developed by James Gosling at Sun Microsystems in 1995..."},
{"title": "Rust Language", "text": "Rust was originally designed by Graydon Hoare at Mozilla Research..."},
]
answers = search_and_answer("Who created Python?", documents)
for a in answers:
print(f"[{a['score']:.3f}] {a['answer']} (from: {a['source']})")
|
Fine-Tuning on Custom Data#
Pre-trained models work well on general questions. For domain-specific QA (medical, legal, technical), fine-tuning on your data improves accuracy significantly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| from transformers import (
AutoModelForQuestionAnswering,
AutoTokenizer,
TrainingArguments,
Trainer,
DefaultDataCollator,
)
from datasets import load_dataset
# Load your QA dataset (SQuAD format)
dataset = load_dataset("json", data_files={"train": "qa_train.json", "test": "qa_test.json"})
model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
def preprocess(examples):
"""Tokenize and find answer span positions."""
tokenized = tokenizer(
examples["question"],
examples["context"],
max_length=384,
truncation="only_second",
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
start_positions = []
end_positions = []
for i, offsets in enumerate(tokenized["offset_mapping"]):
answer = examples["answers"][i]
start_char = answer["answer_start"][0]
end_char = start_char + len(answer["text"][0])
# Find token positions that correspond to the answer
token_start = 0
token_end = 0
for idx, (offset_start, offset_end) in enumerate(offsets):
if offset_start <= start_char < offset_end:
token_start = idx
if offset_start < end_char <= offset_end:
token_end = idx
start_positions.append(token_start)
end_positions.append(token_end)
tokenized["start_positions"] = start_positions
tokenized["end_positions"] = end_positions
return tokenized
tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=dataset["train"].column_names)
training_args = TrainingArguments(
output_dir="./qa-finetuned",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=3e-5,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
data_collator=DefaultDataCollator(),
)
trainer.train()
|
Your training data should follow the SQuAD format:
1
2
3
4
5
6
7
8
| {
"question": "What is the maximum context length?",
"context": "The model supports a maximum context length of 512 tokens.",
"answers": {
"text": ["512 tokens"],
"answer_start": [52]
}
}
|
With 500-1000 annotated examples, expect 5-15% accuracy improvement on domain-specific questions.
Confidence Thresholds and “I Don’t Know”#
A good QA system admits when it doesn’t know the answer instead of guessing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| def answer_with_confidence(question: str, context: str, threshold: float = 0.1) -> dict:
"""Return an answer only if the model is confident enough."""
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question=question, context=context)
if result["score"] < threshold:
return {
"answer": None,
"message": "The answer was not found in the provided context.",
"best_guess": result["answer"],
"confidence": result["score"],
}
return {
"answer": result["answer"],
"confidence": result["score"],
}
# This should return "not found"
print(answer_with_confidence(
"What is the weather today?",
"The Transformer architecture was introduced in 2017.",
))
|
Common Errors and Fixes#
Model always returns low confidence scores
The context is too short or doesn’t contain the answer. Check that your context actually has the information. Also, some models return logits instead of probabilities — pass through softmax if comparing across questions.
Wrong answer span — model picks nearby text
The answer is ambiguous in context. If “2017” appears twice, the model might pick the wrong instance. Use handle_impossible_answer=True with SQuAD 2.0 models to let the model say “no answer” when confused.
Slow inference on CPU
Use a distilled model like distilbert-base-cased-distilled-squad — it’s 40% faster with ~95% of the accuracy. Or batch multiple questions together.
Token limit exceeded on long contexts
Use the sliding window approach from the “Handling Long Documents” section. Don’t just truncate — you’ll cut off the part that contains the answer.
Fine-tuned model overfits on small datasets
Reduce learning rate to 1e-5, add weight decay (0.01), and use early stopping. With fewer than 500 examples, consider few-shot approaches with generative LLMs instead.
Use extractive QA when you need verifiable answers with exact source attribution — compliance, legal, medical contexts where hallucination is unacceptable.
Use generative QA (RAG with LLMs) when you need synthesized answers from multiple sources, when the answer requires reasoning beyond text span extraction, or when natural language fluency matters more than exact quotes.