You have three annotators, two vendor datasets, and a legacy CSV from last year. They all overlap, they all disagree, and somebody needs to merge them into one clean training set. That somebody is you.

Here is the problem in its simplest form – two DataFrames with conflicting labels for the same text:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
import numpy as np

# Two annotation sources with overlapping samples
df_a = pd.DataFrame({
    "text": ["the food was great", "terrible service", "okay experience", "loved the ambiance"],
    "label": ["positive", "negative", "neutral", "positive"],
    "source": "annotator_a",
    "confidence": [0.95, 0.88, 0.60, 0.92]
})

df_b = pd.DataFrame({
    "text": ["the food was great", "terrible service", "okay experience", "price was fair"],
    "label": ["positive", "negative", "positive", "neutral"],
    "source": "annotator_b",
    "confidence": [0.91, 0.93, 0.72, 0.85]
})

# Merge on text to see conflicts
merged = pd.merge(df_a, df_b, on="text", suffixes=("_a", "_b"))
conflicts = merged[merged["label_a"] != merged["label_b"]]
print(conflicts[["text", "label_a", "label_b"]])

That third row – “okay experience” – is neutral in one source and positive in the other. You need a systematic way to handle this across thousands of rows.

Detecting Overlapping Samples

Exact Match Detection

The simplest overlap check is an exact string match on the join key. This catches identical rows across sources.

1
2
3
4
5
6
7
def find_exact_overlaps(df_a: pd.DataFrame, df_b: pd.DataFrame, key: str = "text") -> pd.DataFrame:
    """Find rows that appear in both DataFrames based on exact key match."""
    overlap = pd.merge(df_a, df_b, on=key, suffixes=("_a", "_b"), how="inner")
    print(f"Found {len(overlap)} exact overlaps out of {len(df_a)} + {len(df_b)} total rows")
    return overlap

overlaps = find_exact_overlaps(df_a, df_b)

Fuzzy Match Detection

Real data is messy. “The food was great” and “the food was great!” are the same sample but won’t match exactly. Use rapidfuzz for fast fuzzy matching:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from rapidfuzz import fuzz, process

def find_fuzzy_overlaps(
    df_a: pd.DataFrame,
    df_b: pd.DataFrame,
    key: str = "text",
    threshold: int = 90
) -> list[dict]:
    """Find near-duplicate rows across two DataFrames using fuzzy string matching."""
    matches = []
    texts_b = df_b[key].tolist()

    for idx_a, row_a in df_a.iterrows():
        result = process.extractOne(row_a[key], texts_b, scorer=fuzz.token_sort_ratio)
        if result and result[1] >= threshold:
            matched_text = result[0]
            idx_b = df_b[df_b[key] == matched_text].index[0]
            matches.append({
                "text_a": row_a[key],
                "text_b": matched_text,
                "score": result[1],
                "idx_a": idx_a,
                "idx_b": idx_b
            })

    print(f"Found {len(matches)} fuzzy matches at threshold {threshold}")
    return matches

fuzzy_matches = find_fuzzy_overlaps(df_a, df_b, threshold=85)

Set the threshold based on your domain. For short text like tweets, 85 works well. For longer documents, push it to 92 or higher to avoid false positives.

Conflict Resolution Strategies

Once you know which rows overlap and disagree, you need a policy for picking the winner.

Majority Vote

When you have three or more sources, majority vote is the simplest strategy. The label that appears most often wins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from collections import Counter

def majority_vote(labels: list[str]) -> str:
    """Return the most common label. Ties go to the first label encountered."""
    counts = Counter(labels)
    return counts.most_common(1)[0][0]

# Example with three annotators
labels_for_sample = ["positive", "neutral", "positive"]
winner = majority_vote(labels_for_sample)
print(f"Resolved label: {winner}")  # positive

Confidence-Weighted Resolution

If each annotation comes with a confidence score, weight by that score instead of treating every vote equally. This is my preferred approach when confidence metadata is available.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def confidence_weighted_resolve(
    labels: list[str],
    confidences: list[float]
) -> str:
    """Pick the label with the highest total confidence weight."""
    score_map: dict[str, float] = {}
    for label, conf in zip(labels, confidences):
        score_map[label] = score_map.get(label, 0.0) + conf

    return max(score_map, key=score_map.get)

# "okay experience" from our earlier conflict
resolved = confidence_weighted_resolve(
    labels=["neutral", "positive"],
    confidences=[0.60, 0.72]
)
print(f"Resolved: {resolved}")  # positive (0.72 > 0.60)

Annotator Agreement with Cohen’s Kappa

Before blindly merging, measure how much your annotators agree. Cohen’s kappa tells you if agreement is better than chance. If kappa is low, you probably need re-annotation rather than automated resolution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from sklearn.metrics import cohen_kappa_score

# Only compare labels for overlapping samples
overlap = pd.merge(df_a, df_b, on="text", suffixes=("_a", "_b"), how="inner")
kappa = cohen_kappa_score(overlap["label_a"], overlap["label_b"])
print(f"Cohen's kappa: {kappa:.3f}")

# Interpretation thresholds
if kappa < 0.20:
    print("Poor agreement -- re-annotate or revise guidelines")
elif kappa < 0.40:
    print("Fair agreement -- review conflicting samples manually")
elif kappa < 0.60:
    print("Moderate agreement -- automated resolution is reasonable")
else:
    print("Good agreement -- safe to merge with confidence weighting")

Use kappa as a gate. If it is below 0.40, stop and fix your annotation guidelines before spending compute on merging.

Deduplication

After resolving conflicts, you still need to remove duplicate rows that slipped through from different sources.

Hash-Based Deduplication

Fast and deterministic. Hash the text content and drop duplicates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import hashlib

def hash_dedup(df: pd.DataFrame, text_col: str = "text") -> pd.DataFrame:
    """Remove exact duplicates based on content hash."""
    df = df.copy()
    df["content_hash"] = df[text_col].apply(
        lambda x: hashlib.sha256(x.strip().lower().encode()).hexdigest()
    )
    before = len(df)
    df = df.drop_duplicates(subset="content_hash", keep="first")
    print(f"Dedup: {before} -> {len(df)} rows ({before - len(df)} removed)")
    return df.drop(columns=["content_hash"])

Embedding-Based Deduplication

Hash-based misses semantic duplicates like “great food” and “the food was excellent.” Use sentence embeddings and cosine similarity to catch these.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

def embedding_dedup(
    df: pd.DataFrame,
    text_col: str = "text",
    threshold: float = 0.92
) -> pd.DataFrame:
    """Remove semantic duplicates using TF-IDF cosine similarity."""
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(df[text_col])
    sim_matrix = cosine_similarity(tfidf_matrix)

    # Find pairs above threshold (excluding self-similarity)
    np.fill_diagonal(sim_matrix, 0)
    to_drop = set()

    for i in range(len(sim_matrix)):
        if i in to_drop:
            continue
        for j in range(i + 1, len(sim_matrix)):
            if sim_matrix[i][j] >= threshold:
                to_drop.add(j)

    before = len(df)
    df_clean = df.drop(index=df.index[list(to_drop)]).reset_index(drop=True)
    print(f"Semantic dedup: {before} -> {len(df_clean)} rows ({len(to_drop)} removed)")
    return df_clean

For production workloads with millions of rows, swap TF-IDF for a proper sentence transformer model and use FAISS for approximate nearest neighbor search instead of brute-force cosine similarity.

Building the Full Pipeline

Tie everything together into a single class that chains detection, resolution, and deduplication.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import hashlib
from collections import Counter

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics.pairwise import cosine_similarity


class DatasetMergePipeline:
    def __init__(
        self,
        text_col: str = "text",
        label_col: str = "label",
        confidence_col: str = "confidence",
        kappa_threshold: float = 0.40,
        dedup_similarity: float = 0.92
    ):
        self.text_col = text_col
        self.label_col = label_col
        self.confidence_col = confidence_col
        self.kappa_threshold = kappa_threshold
        self.dedup_similarity = dedup_similarity
        self.stats: dict = {}

    def check_agreement(self, df_a: pd.DataFrame, df_b: pd.DataFrame) -> float:
        overlap = pd.merge(df_a, df_b, on=self.text_col, suffixes=("_a", "_b"), how="inner")
        if len(overlap) < 2:
            print("Warning: fewer than 2 overlapping samples, skipping kappa check")
            return 1.0
        kappa = cohen_kappa_score(
            overlap[f"{self.label_col}_a"],
            overlap[f"{self.label_col}_b"]
        )
        self.stats["kappa"] = kappa
        return kappa

    def resolve_conflicts(self, merged: pd.DataFrame) -> pd.DataFrame:
        """Resolve label conflicts using confidence-weighted voting."""
        rows = []
        for _, row in merged.iterrows():
            label_a = row[f"{self.label_col}_a"]
            label_b = row[f"{self.label_col}_b"]
            conf_a = row.get(f"{self.confidence_col}_a", 0.5)
            conf_b = row.get(f"{self.confidence_col}_b", 0.5)

            if label_a == label_b:
                final_label = label_a
                final_conf = max(conf_a, conf_b)
            elif conf_a >= conf_b:
                final_label = label_a
                final_conf = conf_a
            else:
                final_label = label_b
                final_conf = conf_b

            rows.append({
                self.text_col: row[self.text_col],
                self.label_col: final_label,
                self.confidence_col: final_conf
            })

        return pd.DataFrame(rows)

    def deduplicate(self, df: pd.DataFrame) -> pd.DataFrame:
        # Hash-based first pass
        df = df.copy()
        df["_hash"] = df[self.text_col].apply(
            lambda x: hashlib.sha256(x.strip().lower().encode()).hexdigest()
        )
        df = df.drop_duplicates(subset="_hash", keep="first").drop(columns=["_hash"])

        # Embedding-based second pass
        if len(df) < 2:
            return df.reset_index(drop=True)

        vectorizer = TfidfVectorizer(stop_words="english")
        tfidf = vectorizer.fit_transform(df[self.text_col])
        sim = cosine_similarity(tfidf)
        np.fill_diagonal(sim, 0)

        to_drop = set()
        for i in range(len(sim)):
            if i in to_drop:
                continue
            for j in range(i + 1, len(sim)):
                if sim[i][j] >= self.dedup_similarity:
                    to_drop.add(j)

        return df.drop(index=df.index[list(to_drop)]).reset_index(drop=True)

    def run(self, datasets: list[pd.DataFrame]) -> pd.DataFrame:
        """Merge a list of DataFrames into one deduplicated dataset."""
        if len(datasets) < 2:
            raise ValueError("Need at least 2 datasets to merge")

        # Check pairwise agreement
        kappa = self.check_agreement(datasets[0], datasets[1])
        print(f"Inter-annotator agreement (kappa): {kappa:.3f}")
        if kappa < self.kappa_threshold:
            print(f"Warning: kappa {kappa:.3f} below threshold {self.kappa_threshold}")

        # Merge overlapping samples with conflict resolution
        merged = pd.merge(
            datasets[0], datasets[1],
            on=self.text_col, suffixes=("_a", "_b"), how="inner"
        )
        resolved = self.resolve_conflicts(merged)

        # Collect non-overlapping samples from each source
        overlap_texts = set(resolved[self.text_col])
        unique_parts = []
        for ds in datasets:
            unique = ds[~ds[self.text_col].isin(overlap_texts)][[
                self.text_col, self.label_col, self.confidence_col
            ]]
            unique_parts.append(unique)

        combined = pd.concat([resolved] + unique_parts, ignore_index=True)

        # Deduplicate
        result = self.deduplicate(combined)
        self.stats["input_total"] = sum(len(d) for d in datasets)
        self.stats["output_total"] = len(result)
        print(f"Pipeline complete: {self.stats['input_total']} input -> {len(result)} output rows")
        return result


# Run the pipeline
pipeline = DatasetMergePipeline(kappa_threshold=0.20, dedup_similarity=0.90)
result = pipeline.run([df_a, df_b])
print(result)

That gives you a single clean DataFrame with conflicts resolved and duplicates removed. Adjust kappa_threshold and dedup_similarity for your domain.

Common Errors and Fixes

KeyError on merge columns

You will hit KeyError if the column names differ between sources. Rename columns before merging:

1
2
# Standardize column names before merge
df_vendor = df_vendor.rename(columns={"sentence": "text", "sentiment": "label"})

Empty overlap after merge

If pd.merge(..., how="inner") returns zero rows, your text columns probably have whitespace or encoding differences. Normalize first:

1
df["text"] = df["text"].str.strip().str.lower()

Cohen’s kappa returning negative values

A negative kappa means agreement is worse than random chance. This usually signals a systematic labeling mismatch – one annotator’s “neutral” is another’s “positive.” Align label definitions and re-check.

Memory errors on large cosine similarity matrices

Computing a full n x n similarity matrix blows up memory fast. For datasets over 100k rows, use approximate methods:

1
2
3
4
5
# Use MinHash LSH for large-scale dedup instead of brute-force cosine
# pip install datasketch
from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.8, num_perm=128)

Duplicate rows surviving deduplication

If you are still seeing dupes, check that you are normalizing text before hashing. Case differences, trailing whitespace, and unicode variants (curly quotes vs straight quotes) will produce different hashes for the same content.