How to Build an Extractive Question Answering System with Transformers

The Quick Version

Extractive QA finds the exact text span in a document that answers a question. Unlike generative QA (where an LLM writes an answer), extractive QA points to specific words in the source — which means it can’t hallucinate. The answer is always a direct quote from the context.

1
pip install transformers torch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from transformers import pipeline

qa = pipeline("question-answering", model="deepset/roberta-base-squad2")

context = """
The Transformer architecture was introduced in the paper "Attention Is All You Need"
by Vaswani et al. in 2017. It replaced recurrent neural networks with self-attention
mechanisms, enabling much faster training through parallelization. The original
Transformer had 65 million parameters and was trained on the WMT 2014 English-to-German
translation dataset. BERT, introduced by Google in 2018, used only the encoder part
of the Transformer for language understanding tasks.
"""

result = qa(question="When was the Transformer architecture introduced?", context=context)
print(f"Answer: {result['answer']}")
print(f"Score:  {result['score']:.4f}")
print(f"Span:   [{result['start']}:{result['end']}]")
# Answer: 2017
# Score:  0.9847
# Span:   [123:127]

The model returns the answer text, a confidence score, and the exact character positions in the context. A score above 0.5 generally means the model is confident. Below 0.1 usually means the answer isn’t in the context.

Handling Long Documents

The model has a 512-token limit. For longer documents, split into overlapping chunks and run QA on each:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

def answer_long_document(question: str, document: str, chunk_size: int = 384, stride: int = 128) -> dict:
    """Run QA across overlapping chunks of a long document."""

    # Tokenize with sliding window
    encodings = tokenizer(
        question,
        document,
        max_length=chunk_size,
        stride=stride,
        truncation="only_second",
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
        return_tensors="pt",
    )

    best_answer = {"answer": "", "score": 0.0, "start": 0, "end": 0}

    for i in range(encodings["input_ids"].shape[0]):
        input_ids = encodings["input_ids"][i].unsqueeze(0)
        attention_mask = encodings["attention_mask"][i].unsqueeze(0)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        start_logits = outputs.start_logits.squeeze()
        end_logits = outputs.end_logits.squeeze()

        # Find the best start-end span
        start_idx = torch.argmax(start_logits).item()
        end_idx = torch.argmax(end_logits).item()

        if end_idx >= start_idx:
            score = (start_logits[start_idx] + end_logits[end_idx]).item()
            if score > best_answer["score"]:
                # Map token positions back to character positions
                offsets = encodings["offset_mapping"][i]
                start_char = offsets[start_idx][0].item()
                end_char = offsets[end_idx][1].item()

                answer_tokens = input_ids[0][start_idx:end_idx + 1]
                answer_text = tokenizer.decode(answer_tokens, skip_special_tokens=True)

                best_answer = {
                    "answer": answer_text,
                    "score": score,
                    "start": start_char,
                    "end": end_char,
                    "chunk": i,
                }

    return best_answer

# Works on documents of any length
long_doc = open("research_paper.txt").read()  # could be 10,000+ words
result = answer_long_document("What dataset was used for training?", long_doc)
print(f"Answer: {result['answer']} (score: {result['score']:.2f})")

The stride parameter controls overlap between chunks. 128 tokens of overlap ensures questions that span chunk boundaries still get answered correctly.

Multi-Document QA

Search across multiple documents and rank answers by confidence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def search_and_answer(question: str, documents: list[dict], top_k: int = 3) -> list[dict]:
    """Find answers across multiple documents, ranked by confidence."""
    qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
    all_answers = []

    for doc in documents:
        try:
            result = qa(question=question, context=doc["text"][:2000])
            all_answers.append({
                "answer": result["answer"],
                "score": result["score"],
                "source": doc["title"],
                "start": result["start"],
                "end": result["end"],
            })
        except Exception:
            continue

    # Sort by confidence score
    all_answers.sort(key=lambda x: x["score"], reverse=True)
    return all_answers[:top_k]

documents = [
    {"title": "Python Basics", "text": "Python was created by Guido van Rossum and first released in 1991..."},
    {"title": "Java History", "text": "Java was developed by James Gosling at Sun Microsystems in 1995..."},
    {"title": "Rust Language", "text": "Rust was originally designed by Graydon Hoare at Mozilla Research..."},
]

answers = search_and_answer("Who created Python?", documents)
for a in answers:
    print(f"[{a['score']:.3f}] {a['answer']} (from: {a['source']})")

Fine-Tuning on Custom Data

Pre-trained models work well on general questions. For domain-specific QA (medical, legal, technical), fine-tuning on your data improves accuracy significantly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from transformers import (
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DefaultDataCollator,
)
from datasets import load_dataset

# Load your QA dataset (SQuAD format)
dataset = load_dataset("json", data_files={"train": "qa_train.json", "test": "qa_test.json"})

model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

def preprocess(examples):
    """Tokenize and find answer span positions."""
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        answer = examples["answers"][i]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        # Find token positions that correspond to the answer
        token_start = 0
        token_end = 0
        for idx, (offset_start, offset_end) in enumerate(offsets):
            if offset_start <= start_char < offset_end:
                token_start = idx
            if offset_start < end_char <= offset_end:
                token_end = idx

        start_positions.append(token_start)
        end_positions.append(token_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=dataset["train"].column_names)

training_args = TrainingArguments(
    output_dir="./qa-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=3e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=DefaultDataCollator(),
)
trainer.train()

Data Format

Your training data should follow the SQuAD format:

1
2
3
4
5
6
7
8
{
    "question": "What is the maximum context length?",
    "context": "The model supports a maximum context length of 512 tokens.",
    "answers": {
        "text": ["512 tokens"],
        "answer_start": [52]
    }
}

With 500-1000 annotated examples, expect 5-15% accuracy improvement on domain-specific questions.

Confidence Thresholds and “I Don’t Know”

A good QA system admits when it doesn’t know the answer instead of guessing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def answer_with_confidence(question: str, context: str, threshold: float = 0.1) -> dict:
    """Return an answer only if the model is confident enough."""
    qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
    result = qa(question=question, context=context)

    if result["score"] < threshold:
        return {
            "answer": None,
            "message": "The answer was not found in the provided context.",
            "best_guess": result["answer"],
            "confidence": result["score"],
        }

    return {
        "answer": result["answer"],
        "confidence": result["score"],
    }

# This should return "not found"
print(answer_with_confidence(
    "What is the weather today?",
    "The Transformer architecture was introduced in 2017.",
))

Common Errors and Fixes

Model always returns low confidence scores

The context is too short or doesn’t contain the answer. Check that your context actually has the information. Also, some models return logits instead of probabilities — pass through softmax if comparing across questions.

Wrong answer span — model picks nearby text

The answer is ambiguous in context. If “2017” appears twice, the model might pick the wrong instance. Use handle_impossible_answer=True with SQuAD 2.0 models to let the model say “no answer” when confused.

Slow inference on CPU

Use a distilled model like distilbert-base-cased-distilled-squad — it’s 40% faster with ~95% of the accuracy. Or batch multiple questions together.

Token limit exceeded on long contexts

Use the sliding window approach from the “Handling Long Documents” section. Don’t just truncate — you’ll cut off the part that contains the answer.

Fine-tuned model overfits on small datasets

Reduce learning rate to 1e-5, add weight decay (0.01), and use early stopping. With fewer than 500 examples, consider few-shot approaches with generative LLMs instead.

Extractive vs. Generative QA

Use extractive QA when you need verifiable answers with exact source attribution — compliance, legal, medical contexts where hallucination is unacceptable.

Use generative QA (RAG with LLMs) when you need synthesized answers from multiple sources, when the answer requires reasoning beyond text span extraction, or when natural language fluency matters more than exact quotes.

The Quick Version#

Handling Long Documents#

Multi-Document QA#

Fine-Tuning on Custom Data#

Data Format#

Confidence Thresholds and “I Don’t Know”#

Common Errors and Fixes#

Extractive vs. Generative QA#

Related Guides#

About the Author