How to Build a Resume Parser with spaCy and Transformers

Most resume parsers you find online are either paid SaaS products or half-baked regex scripts that break on the first unusual format. You can build something far better with spaCy for entity extraction and a Hugging Face zero-shot classifier to label resume sections — no training data required.

Here’s the quick version. Extract text from a PDF, pull out entities with spaCy, classify sections with a zero-shot model, and return clean JSON:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import spacy
import fitz  # PyMuPDF
from transformers import pipeline

nlp = spacy.load("en_core_web_sm")
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def parse_resume(pdf_path: str) -> dict:
    doc = fitz.open(pdf_path)
    text = "\n".join(page.get_text() for page in doc)
    doc.close()

    spacy_doc = nlp(text)
    entities = {ent.label_: ent.text for ent in spacy_doc.ents}

    return {
        "person": entities.get("PERSON", ""),
        "organizations": [e.text for e in spacy_doc.ents if e.label_ == "ORG"],
        "dates": [e.text for e in spacy_doc.ents if e.label_ == "DATE"],
        "raw_text": text,
    }

result = parse_resume("resume.pdf")
print(result)

That gets you started. Now let’s build a proper parser that handles contact info, skills, work experience, and education.

Extracting Text from PDF Resumes

PyMuPDF (fitz) is fast and handles most resume PDFs well, including multi-column layouts. Install it alongside the other dependencies:

1
2
pip install pymupdf spacy transformers torch
python -m spacy download en_core_web_sm

The text extraction function needs to handle page ordering and strip out extra whitespace:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import fitz

def extract_text_from_pdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    doc.close()
    return "\n".join(pages).strip()

resume_text = extract_text_from_pdf("resume.pdf")
print(resume_text[:500])

The "text" argument to get_text() gives you plain text with layout preserved. If you’re dealing with scanned PDFs (images instead of text), you’ll need OCR — that’s a different pipeline entirely. This approach works for the vast majority of resumes that are created in Word or Google Docs and exported to PDF.

Extracting Contact Information with spaCy and Regex

spaCy’s NER will catch names and organizations, but it won’t find emails or phone numbers. For those, regex is the right tool — no need to overcomplicate it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import re
import spacy

nlp = spacy.load("en_core_web_sm")

EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
PHONE_PATTERN = re.compile(
    r"(?:\+?\d{1,3}[-.\s]?)?\(?\d{2,4}\)?[-.\s]?\d{3,4}[-.\s]?\d{3,4}"
)
LINKEDIN_PATTERN = re.compile(r"linkedin\.com/in/[\w-]+", re.IGNORECASE)

def extract_contact_info(text: str) -> dict:
    doc = nlp(text)

    # spaCy grabs the first PERSON entity as the candidate name
    person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
    name = person_names[0] if person_names else ""

    emails = EMAIL_PATTERN.findall(text)
    phones = PHONE_PATTERN.findall(text)
    linkedin = LINKEDIN_PATTERN.findall(text)

    return {
        "name": name,
        "email": emails[0] if emails else "",
        "phone": phones[0] if phones else "",
        "linkedin": linkedin[0] if linkedin else "",
    }

sample_text = """
John Martinez
[email protected] | (555) 123-4567
linkedin.com/in/john-martinez-dev

Senior Software Engineer with 8 years of experience building distributed systems.
"""

contact = extract_contact_info(sample_text)
print(contact)
# {'name': 'John Martinez', 'email': '[email protected]',
#  'phone': '(555) 123-4567', 'linkedin': 'linkedin.com/in/john-martinez-dev'}

The name detection isn’t perfect. spaCy sometimes tags company names or section headers as PERSON. A useful heuristic: take the first PERSON entity that appears in the top 5 lines of the resume. Names almost always appear at the very top.

Classifying Resume Sections with Zero-Shot

Resumes have common sections — experience, education, skills, projects — but they’re labeled inconsistently. One resume says “Work Experience”, another says “Professional Background”, a third just says “Experience”. Zero-shot classification handles this without any training:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

SECTION_LABELS = ["work experience", "education", "skills", "projects", "summary", "certifications"]

def classify_sections(text: str) -> list[dict]:
    # Split on lines that look like section headers (all caps or title case, short)
    lines = text.split("\n")
    sections = []
    current_header = ""
    current_content = []

    for line in lines:
        stripped = line.strip()
        # Heuristic: headers are short, often uppercase or title-cased
        if stripped and len(stripped.split()) <= 5 and (stripped.isupper() or stripped.istitle()):
            if current_header:
                sections.append({"header": current_header, "content": "\n".join(current_content)})
            current_header = stripped
            current_content = []
        else:
            current_content.append(line)

    if current_header:
        sections.append({"header": current_header, "content": "\n".join(current_content)})

    # Classify each section header
    for section in sections:
        result = classifier(section["header"], candidate_labels=SECTION_LABELS)
        section["label"] = result["labels"][0]
        section["confidence"] = result["scores"][0]

    return sections

sample_resume = """PROFESSIONAL EXPERIENCE
Software Engineer at Acme Corp, 2020-2024
Built microservices handling 10k requests per second.

EDUCATION
B.S. Computer Science, MIT, 2020

TECHNICAL SKILLS
Python, Go, Kubernetes, PostgreSQL, AWS
"""

sections = classify_sections(sample_resume)
for s in sections:
    print(f"{s['header']} -> {s['label']} ({s['confidence']:.2f})")
# PROFESSIONAL EXPERIENCE -> work experience (0.92)
# EDUCATION -> education (0.97)
# TECHNICAL SKILLS -> skills (0.95)

The BART-MNLI model does well on this task out of the box. Confidence above 0.8 is typically reliable. Below that, you might want to fall back to keyword matching on the header text.

Putting It All Together

Now combine everything into a single function that takes a PDF path and returns structured JSON:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import json
import re
import fitz
import spacy
from transformers import pipeline

nlp = spacy.load("en_core_web_sm")
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
PHONE_PATTERN = re.compile(
    r"(?:\+?\d{1,3}[-.\s]?)?\(?\d{2,4}\)?[-.\s]?\d{3,4}[-.\s]?\d{3,4}"
)
SECTION_LABELS = ["work experience", "education", "skills", "projects", "summary", "certifications"]


def extract_skills_from_section(text: str) -> list[str]:
    """Pull individual skills from a skills section."""
    # Skills are usually comma-separated, pipe-separated, or one per line
    skills = re.split(r"[,|;\n]", text)
    return [s.strip() for s in skills if s.strip() and len(s.strip()) < 50]


def parse_resume_full(pdf_path: str) -> dict:
    # 1. Extract raw text
    doc = fitz.open(pdf_path)
    text = "\n".join(page.get_text("text") for page in doc)
    doc.close()

    # 2. Contact info
    spacy_doc = nlp(text)
    person_names = [ent.text for ent in spacy_doc.ents if ent.label_ == "PERSON"]
    emails = EMAIL_PATTERN.findall(text)
    phones = PHONE_PATTERN.findall(text)

    contact = {
        "name": person_names[0] if person_names else "",
        "email": emails[0] if emails else "",
        "phone": phones[0] if phones else "",
    }

    # 3. Section classification
    lines = text.split("\n")
    sections = []
    current_header = ""
    current_content = []

    for line in lines:
        stripped = line.strip()
        if stripped and len(stripped.split()) <= 5 and (stripped.isupper() or stripped.istitle()):
            if current_header:
                sections.append({"header": current_header, "content": "\n".join(current_content)})
            current_header = stripped
            current_content = []
        else:
            current_content.append(line)

    if current_header:
        sections.append({"header": current_header, "content": "\n".join(current_content)})

    for section in sections:
        result = classifier(section["header"], candidate_labels=SECTION_LABELS)
        section["label"] = result["labels"][0]
        section["confidence"] = result["scores"][0]

    # 4. Build structured output
    parsed = {
        "contact": contact,
        "skills": [],
        "experience": [],
        "education": [],
    }

    for section in sections:
        if section["label"] == "skills" and section["confidence"] > 0.7:
            parsed["skills"] = extract_skills_from_section(section["content"])
        elif section["label"] == "work experience" and section["confidence"] > 0.7:
            parsed["experience"].append(section["content"].strip())
        elif section["label"] == "education" and section["confidence"] > 0.7:
            parsed["education"].append(section["content"].strip())

    return parsed


result = parse_resume_full("resume.pdf")
print(json.dumps(result, indent=2))

This gives you output like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "contact": {
    "name": "John Martinez",
    "email": "[email protected]",
    "phone": "(555) 123-4567"
  },
  "skills": ["Python", "Go", "Kubernetes", "PostgreSQL", "AWS"],
  "experience": ["Software Engineer at Acme Corp, 2020-2024\nBuilt microservices handling 10k requests per second."],
  "education": ["B.S. Computer Science, MIT, 2020"]
}

Common Errors and Fixes

OSError: [E050] Can't find model 'en_core_web_sm'

You need to download the model separately after installing spaCy:

1
python -m spacy download en_core_web_sm

RuntimeError: No module named 'torch'

The facebook/bart-large-mnli model needs PyTorch. Install it:

1
pip install torch

If you’re on a machine without a GPU, the zero-shot classifier still works — it just runs on CPU. Expect about 1-2 seconds per classification call. For batch processing many resumes, consider caching the classifier instance (which the code above already does by creating it at module level).

Phone regex matches random numbers

The phone pattern can be greedy. If you’re getting false positives, restrict it to only match numbers that appear near the top of the resume (first 10 lines). Most resumes put contact info in a header block:

1
2
header_text = "\n".join(text.split("\n")[:10])
phones = PHONE_PATTERN.findall(header_text)

spaCy tags a company name as PERSON

This happens with names like “Chase” or “Wells Fargo.” Filter by position — the candidate name should appear in the first few lines. You can also cross-reference against the ORG entities and remove any overlap:

1
2
3
4
5
orgs = {ent.text for ent in spacy_doc.ents if ent.label_ == "ORG"}
person_names = [
    ent.text for ent in spacy_doc.ents
    if ent.label_ == "PERSON" and ent.text not in orgs
]

Section headers not detected

The heuristic for header detection (short + uppercase/title case) won’t catch every format. Some resumes use bold text with no casing difference, or embed headers in tables. For those edge cases, you could classify every paragraph instead of just detected headers — but that’s slower and usually overkill for 90% of resumes.

Extracting Text from PDF Resumes#

Extracting Contact Information with spaCy and Regex#

Classifying Resume Sections with Zero-Shot#

Putting It All Together#

Common Errors and Fixes#

Related Guides#

About the Author

Extracting Text from PDF Resumes

Extracting Contact Information with spaCy and Regex

Classifying Resume Sections with Zero-Shot

Putting It All Together

Common Errors and Fixes

Related Guides