Natural language inference (NLI) answers a simple question: given two sentences, does the second one follow from the first, contradict it, or say something unrelated? That three-way classification – entailment, contradiction, neutral – is the backbone of fact verification, document consistency checks, and zero-shot classification.
You give the model a premise like “The restaurant closes at 9 PM” and a hypothesis like “The restaurant is open at midnight.” The model outputs contradiction. Change the hypothesis to “The restaurant is not open all night” and you get entailment. This is surprisingly useful once you wire it into a pipeline.
Zero-Shot NLI with a Pre-Trained Model#
The fastest way to get NLI predictions is the cross-encoder/nli-deberta-v3-base model. It’s a cross-encoder trained on SNLI and MultiNLI, and it outputs logits for all three labels directly.
1
| pip install transformers torch
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
premise = "The company reported record profits in Q3 2025."
hypothesis = "The company lost money last quarter."
inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1).squeeze()
labels = ["contradiction", "entailment", "neutral"]
for label, prob in zip(labels, probs):
print(f"{label}: {prob:.4f}")
predicted = labels[probs.argmax()]
print(f"\nPrediction: {predicted}")
|
Output:
1
2
3
4
5
| contradiction: 0.9842
entailment: 0.0031
neutral: 0.0127
Prediction: contradiction
|
The label ordering for cross-encoder/nli-deberta-v3-base is [contradiction, entailment, neutral] at indices 0, 1, 2. This matters – get it wrong and you’ll silently swap entailment with contradiction. Always check the model’s config.json for id2label to confirm.
You can also use facebook/bart-large-mnli through the zero-shot classification pipeline, which wraps NLI into a more convenient API:
1
2
3
4
5
6
7
8
9
10
11
| from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"The patient's blood pressure dropped significantly after the medication.",
candidate_labels=["the medication lowered blood pressure", "the medication had no effect", "the medication raised blood pressure"],
)
for label, score in zip(result["labels"], result["scores"]):
print(f"{label}: {score:.4f}")
|
This repurposes NLI for arbitrary classification – each candidate label is treated as a hypothesis tested against the input premise.
Building a Batch NLI Pipeline#
Processing one pair at a time is fine for prototyping, but real workloads need batch inference. Here’s a pipeline that processes many premise-hypothesis pairs and returns structured results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from dataclasses import dataclass
@dataclass
class NLIResult:
premise: str
hypothesis: str
label: str
scores: dict[str, float]
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
LABELS = ["contradiction", "entailment", "neutral"]
def predict_nli_batch(pairs: list[tuple[str, str]], batch_size: int = 16) -> list[NLIResult]:
results = []
for i in range(0, len(pairs), batch_size):
batch = pairs[i : i + batch_size]
premises = [p for p, _ in batch]
hypotheses = [h for _, h in batch]
inputs = tokenizer(
premises,
hypotheses,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1)
for j, (premise, hypothesis) in enumerate(batch):
scores = {label: probs[j][k].item() for k, label in enumerate(LABELS)}
predicted = LABELS[probs[j].argmax()]
results.append(NLIResult(premise=premise, hypothesis=hypothesis, label=predicted, scores=scores))
return results
# Test with multiple pairs
pairs = [
("All employees must attend the Monday meeting.", "The Monday meeting is optional."),
("The bridge was built in 1934.", "The bridge is less than 100 years old."),
("She finished the marathon in under three hours.", "She is a fast runner."),
("Python 3.12 added the type statement.", "Python 3.12 removed type hints."),
("The server runs on port 8080.", "The database uses PostgreSQL."),
]
results = predict_nli_batch(pairs)
for r in results:
print(f"[{r.label.upper():>13}] {r.premise}")
print(f" {r.hypothesis}")
print(f" scores: { {k: round(v, 3) for k, v in r.scores.items()} }")
print()
|
The batch_size parameter controls how many pairs get tokenized and forwarded together. On a GPU, 32-64 pairs per batch works well. On CPU, keep it at 8-16 to avoid memory spikes. The padding handles variable-length inputs within each batch.
Using NLI for Fact Verification#
Here’s where NLI gets practical. Say you have a claim from a user or an LLM, and you have evidence documents. You want to know: does the evidence support or contradict the claim?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
LABELS = ["contradiction", "entailment", "neutral"]
def verify_claim(claim: str, evidence_passages: list[str], threshold: float = 0.7) -> dict:
best_support = {"score": 0.0, "passage": "", "label": "neutral"}
best_contradiction = {"score": 0.0, "passage": "", "label": "neutral"}
for passage in evidence_passages:
inputs = tokenizer(passage, claim, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1).squeeze()
entailment_score = probs[1].item()
contradiction_score = probs[0].item()
if entailment_score > best_support["score"]:
best_support = {"score": entailment_score, "passage": passage, "label": "entailment"}
if contradiction_score > best_contradiction["score"]:
best_contradiction = {"score": contradiction_score, "passage": passage, "label": "contradiction"}
if best_support["score"] >= threshold:
verdict = "SUPPORTED"
evidence = best_support
elif best_contradiction["score"] >= threshold:
verdict = "CONTRADICTED"
evidence = best_contradiction
else:
verdict = "NOT ENOUGH EVIDENCE"
evidence = best_support if best_support["score"] > best_contradiction["score"] else best_contradiction
return {"claim": claim, "verdict": verdict, "confidence": evidence["score"], "evidence": evidence["passage"]}
# Example: verify claims against a knowledge base
evidence = [
"Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning.",
"Elon Musk joined Tesla's board in 2004 as chairman after leading the Series A funding round.",
"Tesla's first production vehicle, the Roadster, was delivered starting in 2008.",
]
claims = [
"Elon Musk founded Tesla.",
"Tesla's first car was the Roadster.",
"Tesla was founded in 2010.",
]
for claim in claims:
result = verify_claim(claim, evidence)
print(f"Claim: {result['claim']}")
print(f"Verdict: {result['verdict']} (confidence: {result['confidence']:.3f})")
print(f"Evidence: {result['evidence'][:80]}...")
print()
|
The premise is the evidence passage, and the hypothesis is the claim. This ordering matters because NLI models are trained with premise first, hypothesis second. Swapping them gives different (and usually worse) results.
The threshold of 0.7 is a reasonable starting point. Lower it if you want more aggressive detection at the cost of more false positives. Raise it if precision matters more than recall.
Fine-Tuning an NLI Model on Custom Data#
Pre-trained NLI models work well for general text, but domain-specific language (legal, medical, financial) often needs fine-tuning. Here’s how to fine-tune microsoft/deberta-v3-base on custom NLI pairs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
| from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
# Custom training data: (premise, hypothesis, label)
# Labels: 0 = contradiction, 1 = entailment, 2 = neutral
train_examples = [
("The server response time is under 200ms.", "The API is fast.", 1),
("The server response time is under 200ms.", "The server is slow and unresponsive.", 0),
("The server response time is under 200ms.", "The server uses PostgreSQL.", 2),
("The deployment failed due to a missing dependency.", "The deployment was successful.", 0),
("The deployment failed due to a missing dependency.", "There was a package that wasn't installed.", 1),
("The deployment failed due to a missing dependency.", "The team uses Kubernetes.", 2),
("All unit tests pass on the main branch.", "The main branch has no test failures.", 1),
("All unit tests pass on the main branch.", "Several tests are failing in CI.", 0),
("All unit tests pass on the main branch.", "The project uses pytest.", 2),
("GPU utilization peaked at 95% during training.", "The GPU was heavily loaded.", 1),
("GPU utilization peaked at 95% during training.", "The GPU was idle during the run.", 0),
("GPU utilization peaked at 95% during training.", "The model has 7 billion parameters.", 2),
("The model accuracy on the test set is 94.2%.", "The model performs well on evaluation.", 1),
("The model accuracy on the test set is 94.2%.", "The model failed to beat the baseline.", 0),
("The model accuracy on the test set is 94.2%.", "Training took six hours.", 2),
]
# In production, you'd have hundreds or thousands of examples.
# This small set demonstrates the data format.
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_ds = Dataset.from_dict({
"premise": [ex[0] for ex in train_examples],
"hypothesis": [ex[1] for ex in train_examples],
"label": [ex[2] for ex in train_examples],
})
def tokenize_fn(batch):
return tokenizer(
batch["premise"],
batch["hypothesis"],
truncation=True,
padding="max_length",
max_length=256,
)
train_ds = train_ds.map(tokenize_fn, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3,
id2label={0: "contradiction", 1: "entailment", 2: "neutral"},
label2id={"contradiction": 0, "entailment": 1, "neutral": 2},
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
accuracy = (preds == labels).mean()
return {"accuracy": accuracy}
training_args = TrainingArguments(
output_dir="./nli-finetuned",
num_train_epochs=5,
per_device_train_batch_size=8,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
report_to="none",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
compute_metrics=compute_metrics,
)
trainer.train()
model.save_pretrained("./nli-finetuned")
tokenizer.save_pretrained("./nli-finetuned")
print("Model saved to ./nli-finetuned")
|
A few things to watch for: deberta-v3-base is a solid choice for fine-tuning because it’s smaller than the large variants but still punches well above its weight on NLI benchmarks. With 500+ training examples you’ll see meaningful gains on domain-specific text. With fewer than that, consider starting from an already NLI-trained checkpoint like cross-encoder/nli-deberta-v3-base instead of the base model.
The id2label and label2id mappings in from_pretrained are critical. If your training labels don’t match the mapping, the model will learn the wrong associations and predictions will be garbage.
Common Errors and Fixes#
Token length exceeded – inputs silently truncated
DeBERTa models have a 512-token limit. If your premise + hypothesis exceeds that, the tokenizer truncates from the right by default, which usually chops the hypothesis. Fix it by setting truncation="only_first" to truncate the premise instead, or split long documents into chunks and score each chunk separately:
1
2
3
4
5
6
7
8
| # Truncate premise (first argument) instead of hypothesis
inputs = tokenizer(
premise,
hypothesis,
return_tensors="pt",
truncation="only_first",
max_length=512,
)
|
Label mapping mismatch
Different NLI models use different label orderings. facebook/bart-large-mnli uses [contradiction, neutral, entailment] while cross-encoder/nli-deberta-v3-base uses [contradiction, entailment, neutral]. Always check:
1
2
3
4
5
6
7
8
9
| from transformers import AutoConfig
config = AutoConfig.from_pretrained("cross-encoder/nli-deberta-v3-base")
print(config.id2label)
# {0: 'contradiction', 1: 'entailment', 2: 'neutral'}
config2 = AutoConfig.from_pretrained("facebook/bart-large-mnli")
print(config2.id2label)
# {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
|
Hardcoding label indices without checking id2label is the most common source of wrong predictions in NLI pipelines.
CUDA out of memory on large batches
If you hit OOM during batch inference, reduce batch_size or move to CPU for that batch. You can also use half-precision inference to cut memory in half:
1
| model = model.half().cuda() # FP16 inference, ~50% less VRAM
|
For fine-tuning, add fp16=True to TrainingArguments if you have a GPU with mixed-precision support (anything Volta or newer). This typically lets you double your batch size.
Premise-hypothesis order matters
NLI models are not symmetric. Swapping the premise and hypothesis changes the prediction. “A entails B” does not mean “B entails A.” Always pass the evidence/context as the premise and the claim/query as the hypothesis.