You have three annotators, two vendor datasets, and a legacy CSV from last year. They all overlap, they all disagree, and somebody needs to merge them into one clean training set. That somebody is you.
Here is the problem in its simplest form – two DataFrames with conflicting labels for the same text:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| import pandas as pd
import numpy as np
# Two annotation sources with overlapping samples
df_a = pd.DataFrame({
"text": ["the food was great", "terrible service", "okay experience", "loved the ambiance"],
"label": ["positive", "negative", "neutral", "positive"],
"source": "annotator_a",
"confidence": [0.95, 0.88, 0.60, 0.92]
})
df_b = pd.DataFrame({
"text": ["the food was great", "terrible service", "okay experience", "price was fair"],
"label": ["positive", "negative", "positive", "neutral"],
"source": "annotator_b",
"confidence": [0.91, 0.93, 0.72, 0.85]
})
# Merge on text to see conflicts
merged = pd.merge(df_a, df_b, on="text", suffixes=("_a", "_b"))
conflicts = merged[merged["label_a"] != merged["label_b"]]
print(conflicts[["text", "label_a", "label_b"]])
|
That third row – “okay experience” – is neutral in one source and positive in the other. You need a systematic way to handle this across thousands of rows.
Detecting Overlapping Samples#
Exact Match Detection#
The simplest overlap check is an exact string match on the join key. This catches identical rows across sources.
1
2
3
4
5
6
7
| def find_exact_overlaps(df_a: pd.DataFrame, df_b: pd.DataFrame, key: str = "text") -> pd.DataFrame:
"""Find rows that appear in both DataFrames based on exact key match."""
overlap = pd.merge(df_a, df_b, on=key, suffixes=("_a", "_b"), how="inner")
print(f"Found {len(overlap)} exact overlaps out of {len(df_a)} + {len(df_b)} total rows")
return overlap
overlaps = find_exact_overlaps(df_a, df_b)
|
Fuzzy Match Detection#
Real data is messy. “The food was great” and “the food was great!” are the same sample but won’t match exactly. Use rapidfuzz for fast fuzzy matching:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from rapidfuzz import fuzz, process
def find_fuzzy_overlaps(
df_a: pd.DataFrame,
df_b: pd.DataFrame,
key: str = "text",
threshold: int = 90
) -> list[dict]:
"""Find near-duplicate rows across two DataFrames using fuzzy string matching."""
matches = []
texts_b = df_b[key].tolist()
for idx_a, row_a in df_a.iterrows():
result = process.extractOne(row_a[key], texts_b, scorer=fuzz.token_sort_ratio)
if result and result[1] >= threshold:
matched_text = result[0]
idx_b = df_b[df_b[key] == matched_text].index[0]
matches.append({
"text_a": row_a[key],
"text_b": matched_text,
"score": result[1],
"idx_a": idx_a,
"idx_b": idx_b
})
print(f"Found {len(matches)} fuzzy matches at threshold {threshold}")
return matches
fuzzy_matches = find_fuzzy_overlaps(df_a, df_b, threshold=85)
|
Set the threshold based on your domain. For short text like tweets, 85 works well. For longer documents, push it to 92 or higher to avoid false positives.
Conflict Resolution Strategies#
Once you know which rows overlap and disagree, you need a policy for picking the winner.
Majority Vote#
When you have three or more sources, majority vote is the simplest strategy. The label that appears most often wins.
1
2
3
4
5
6
7
8
9
10
11
| from collections import Counter
def majority_vote(labels: list[str]) -> str:
"""Return the most common label. Ties go to the first label encountered."""
counts = Counter(labels)
return counts.most_common(1)[0][0]
# Example with three annotators
labels_for_sample = ["positive", "neutral", "positive"]
winner = majority_vote(labels_for_sample)
print(f"Resolved label: {winner}") # positive
|
Confidence-Weighted Resolution#
If each annotation comes with a confidence score, weight by that score instead of treating every vote equally. This is my preferred approach when confidence metadata is available.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def confidence_weighted_resolve(
labels: list[str],
confidences: list[float]
) -> str:
"""Pick the label with the highest total confidence weight."""
score_map: dict[str, float] = {}
for label, conf in zip(labels, confidences):
score_map[label] = score_map.get(label, 0.0) + conf
return max(score_map, key=score_map.get)
# "okay experience" from our earlier conflict
resolved = confidence_weighted_resolve(
labels=["neutral", "positive"],
confidences=[0.60, 0.72]
)
print(f"Resolved: {resolved}") # positive (0.72 > 0.60)
|
Annotator Agreement with Cohen’s Kappa#
Before blindly merging, measure how much your annotators agree. Cohen’s kappa tells you if agreement is better than chance. If kappa is low, you probably need re-annotation rather than automated resolution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| from sklearn.metrics import cohen_kappa_score
# Only compare labels for overlapping samples
overlap = pd.merge(df_a, df_b, on="text", suffixes=("_a", "_b"), how="inner")
kappa = cohen_kappa_score(overlap["label_a"], overlap["label_b"])
print(f"Cohen's kappa: {kappa:.3f}")
# Interpretation thresholds
if kappa < 0.20:
print("Poor agreement -- re-annotate or revise guidelines")
elif kappa < 0.40:
print("Fair agreement -- review conflicting samples manually")
elif kappa < 0.60:
print("Moderate agreement -- automated resolution is reasonable")
else:
print("Good agreement -- safe to merge with confidence weighting")
|
Use kappa as a gate. If it is below 0.40, stop and fix your annotation guidelines before spending compute on merging.
Deduplication#
After resolving conflicts, you still need to remove duplicate rows that slipped through from different sources.
Hash-Based Deduplication#
Fast and deterministic. Hash the text content and drop duplicates.
1
2
3
4
5
6
7
8
9
10
11
12
| import hashlib
def hash_dedup(df: pd.DataFrame, text_col: str = "text") -> pd.DataFrame:
"""Remove exact duplicates based on content hash."""
df = df.copy()
df["content_hash"] = df[text_col].apply(
lambda x: hashlib.sha256(x.strip().lower().encode()).hexdigest()
)
before = len(df)
df = df.drop_duplicates(subset="content_hash", keep="first")
print(f"Dedup: {before} -> {len(df)} rows ({before - len(df)} removed)")
return df.drop(columns=["content_hash"])
|
Embedding-Based Deduplication#
Hash-based misses semantic duplicates like “great food” and “the food was excellent.” Use sentence embeddings and cosine similarity to catch these.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
def embedding_dedup(
df: pd.DataFrame,
text_col: str = "text",
threshold: float = 0.92
) -> pd.DataFrame:
"""Remove semantic duplicates using TF-IDF cosine similarity."""
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(df[text_col])
sim_matrix = cosine_similarity(tfidf_matrix)
# Find pairs above threshold (excluding self-similarity)
np.fill_diagonal(sim_matrix, 0)
to_drop = set()
for i in range(len(sim_matrix)):
if i in to_drop:
continue
for j in range(i + 1, len(sim_matrix)):
if sim_matrix[i][j] >= threshold:
to_drop.add(j)
before = len(df)
df_clean = df.drop(index=df.index[list(to_drop)]).reset_index(drop=True)
print(f"Semantic dedup: {before} -> {len(df_clean)} rows ({len(to_drop)} removed)")
return df_clean
|
For production workloads with millions of rows, swap TF-IDF for a proper sentence transformer model and use FAISS for approximate nearest neighbor search instead of brute-force cosine similarity.
Building the Full Pipeline#
Tie everything together into a single class that chains detection, resolution, and deduplication.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
| import hashlib
from collections import Counter
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics.pairwise import cosine_similarity
class DatasetMergePipeline:
def __init__(
self,
text_col: str = "text",
label_col: str = "label",
confidence_col: str = "confidence",
kappa_threshold: float = 0.40,
dedup_similarity: float = 0.92
):
self.text_col = text_col
self.label_col = label_col
self.confidence_col = confidence_col
self.kappa_threshold = kappa_threshold
self.dedup_similarity = dedup_similarity
self.stats: dict = {}
def check_agreement(self, df_a: pd.DataFrame, df_b: pd.DataFrame) -> float:
overlap = pd.merge(df_a, df_b, on=self.text_col, suffixes=("_a", "_b"), how="inner")
if len(overlap) < 2:
print("Warning: fewer than 2 overlapping samples, skipping kappa check")
return 1.0
kappa = cohen_kappa_score(
overlap[f"{self.label_col}_a"],
overlap[f"{self.label_col}_b"]
)
self.stats["kappa"] = kappa
return kappa
def resolve_conflicts(self, merged: pd.DataFrame) -> pd.DataFrame:
"""Resolve label conflicts using confidence-weighted voting."""
rows = []
for _, row in merged.iterrows():
label_a = row[f"{self.label_col}_a"]
label_b = row[f"{self.label_col}_b"]
conf_a = row.get(f"{self.confidence_col}_a", 0.5)
conf_b = row.get(f"{self.confidence_col}_b", 0.5)
if label_a == label_b:
final_label = label_a
final_conf = max(conf_a, conf_b)
elif conf_a >= conf_b:
final_label = label_a
final_conf = conf_a
else:
final_label = label_b
final_conf = conf_b
rows.append({
self.text_col: row[self.text_col],
self.label_col: final_label,
self.confidence_col: final_conf
})
return pd.DataFrame(rows)
def deduplicate(self, df: pd.DataFrame) -> pd.DataFrame:
# Hash-based first pass
df = df.copy()
df["_hash"] = df[self.text_col].apply(
lambda x: hashlib.sha256(x.strip().lower().encode()).hexdigest()
)
df = df.drop_duplicates(subset="_hash", keep="first").drop(columns=["_hash"])
# Embedding-based second pass
if len(df) < 2:
return df.reset_index(drop=True)
vectorizer = TfidfVectorizer(stop_words="english")
tfidf = vectorizer.fit_transform(df[self.text_col])
sim = cosine_similarity(tfidf)
np.fill_diagonal(sim, 0)
to_drop = set()
for i in range(len(sim)):
if i in to_drop:
continue
for j in range(i + 1, len(sim)):
if sim[i][j] >= self.dedup_similarity:
to_drop.add(j)
return df.drop(index=df.index[list(to_drop)]).reset_index(drop=True)
def run(self, datasets: list[pd.DataFrame]) -> pd.DataFrame:
"""Merge a list of DataFrames into one deduplicated dataset."""
if len(datasets) < 2:
raise ValueError("Need at least 2 datasets to merge")
# Check pairwise agreement
kappa = self.check_agreement(datasets[0], datasets[1])
print(f"Inter-annotator agreement (kappa): {kappa:.3f}")
if kappa < self.kappa_threshold:
print(f"Warning: kappa {kappa:.3f} below threshold {self.kappa_threshold}")
# Merge overlapping samples with conflict resolution
merged = pd.merge(
datasets[0], datasets[1],
on=self.text_col, suffixes=("_a", "_b"), how="inner"
)
resolved = self.resolve_conflicts(merged)
# Collect non-overlapping samples from each source
overlap_texts = set(resolved[self.text_col])
unique_parts = []
for ds in datasets:
unique = ds[~ds[self.text_col].isin(overlap_texts)][[
self.text_col, self.label_col, self.confidence_col
]]
unique_parts.append(unique)
combined = pd.concat([resolved] + unique_parts, ignore_index=True)
# Deduplicate
result = self.deduplicate(combined)
self.stats["input_total"] = sum(len(d) for d in datasets)
self.stats["output_total"] = len(result)
print(f"Pipeline complete: {self.stats['input_total']} input -> {len(result)} output rows")
return result
# Run the pipeline
pipeline = DatasetMergePipeline(kappa_threshold=0.20, dedup_similarity=0.90)
result = pipeline.run([df_a, df_b])
print(result)
|
That gives you a single clean DataFrame with conflicts resolved and duplicates removed. Adjust kappa_threshold and dedup_similarity for your domain.
Common Errors and Fixes#
KeyError on merge columns#
You will hit KeyError if the column names differ between sources. Rename columns before merging:
1
2
| # Standardize column names before merge
df_vendor = df_vendor.rename(columns={"sentence": "text", "sentiment": "label"})
|
Empty overlap after merge#
If pd.merge(..., how="inner") returns zero rows, your text columns probably have whitespace or encoding differences. Normalize first:
1
| df["text"] = df["text"].str.strip().str.lower()
|
Cohen’s kappa returning negative values#
A negative kappa means agreement is worse than random chance. This usually signals a systematic labeling mismatch – one annotator’s “neutral” is another’s “positive.” Align label definitions and re-check.
Memory errors on large cosine similarity matrices#
Computing a full n x n similarity matrix blows up memory fast. For datasets over 100k rows, use approximate methods:
1
2
3
4
5
| # Use MinHash LSH for large-scale dedup instead of brute-force cosine
# pip install datasketch
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.8, num_perm=128)
|
Duplicate rows surviving deduplication#
If you are still seeing dupes, check that you are normalizing text before hashing. Case differences, trailing whitespace, and unicode variants (curly quotes vs straight quotes) will produce different hashes for the same content.