Code reviews are one of the best use cases for LLM agents. The input is structured (diffs), the output is structured (comments with severity), and the task is well-defined. You can build a working code review agent in under 200 lines of Python.

Here’s the minimal version that grabs staged changes and sends them to an LLM:

1
pip install openai
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import subprocess
import json
from openai import OpenAI

client = OpenAI()

def get_staged_diff() -> str:
    """Get the staged Git diff."""
    result = subprocess.run(
        ["git", "diff", "--staged"],
        capture_output=True,
        text=True,
        check=True,
    )
    return result.stdout

def review_diff(diff: str) -> str:
    """Send a diff to the LLM for review."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a senior software engineer doing a code review. "
                    "Analyze the Git diff and return feedback as JSON. "
                    "Each item should have: file, line, severity (error|warning|suggestion), message. "
                    "Only flag real issues. Be specific about what's wrong and how to fix it."
                ),
            },
            {"role": "user", "content": f"Review this diff:\n\n{diff}"},
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return response.choices[0].message.content

diff = get_staged_diff()
if diff:
    feedback = review_diff(diff)
    print(feedback)
else:
    print("No staged changes to review.")

That works, but it’s raw. The diff goes in as one blob and the feedback comes back as unstructured JSON. Let’s make it production-grade.

Parsing Git Diffs into Structured Hunks

Raw diffs are noisy. Sending the entire diff as one prompt wastes tokens and confuses the model when files are unrelated. Parse the diff into per-file hunks so you can review each file independently.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from dataclasses import dataclass, field

@dataclass
class DiffHunk:
    header: str
    old_start: int
    new_start: int
    lines: list[str] = field(default_factory=list)

@dataclass
class FileDiff:
    filename: str
    hunks: list[DiffHunk] = field(default_factory=list)

def parse_diff(raw_diff: str) -> list[FileDiff]:
    """Parse a unified diff into per-file structured hunks."""
    files: list[FileDiff] = []
    current_file = None
    current_hunk = None

    for line in raw_diff.splitlines():
        if line.startswith("diff --git"):
            # Extract filename from "diff --git a/path b/path"
            parts = line.split(" b/")
            filename = parts[-1] if len(parts) > 1 else "unknown"
            current_file = FileDiff(filename=filename)
            files.append(current_file)
            current_hunk = None

        elif line.startswith("@@") and current_file is not None:
            # Parse hunk header like "@@ -10,5 +10,7 @@"
            header = line
            # Extract line numbers from the hunk header
            parts = line.split(" ")
            old_info = parts[1]  # e.g., "-10,5"
            new_info = parts[2]  # e.g., "+10,7"
            old_start = int(old_info.split(",")[0].lstrip("-"))
            new_start = int(new_info.split(",")[0].lstrip("+"))
            current_hunk = DiffHunk(
                header=header, old_start=old_start, new_start=new_start
            )
            current_file.hunks.append(current_hunk)

        elif current_hunk is not None:
            current_hunk.lines.append(line)

    return files

Now you can iterate over files and send each one to the LLM separately. This keeps the context focused and lets you parallelize reviews across files.

1
2
3
4
5
6
7
diff = get_staged_diff()
file_diffs = parse_diff(diff)

for fd in file_diffs:
    print(f"{fd.filename}: {len(fd.hunks)} hunk(s)")
    for hunk in fd.hunks:
        print(f"  {hunk.header} ({len(hunk.lines)} lines)")

Structured Review with Function Calling

Asking the model for JSON in a system prompt is fragile. You get better results using OpenAI’s tools parameter to enforce the output schema. This guarantees the response matches your data structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
review_tool = {
    "type": "function",
    "function": {
        "name": "submit_review",
        "description": "Submit code review feedback for a file diff",
        "parameters": {
            "type": "object",
            "properties": {
                "comments": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "line": {
                                "type": "integer",
                                "description": "Line number in the new file",
                            },
                            "severity": {
                                "type": "string",
                                "enum": ["error", "warning", "suggestion"],
                                "description": "error=must fix, warning=should fix, suggestion=nice to have",
                            },
                            "message": {
                                "type": "string",
                                "description": "What's wrong and how to fix it",
                            },
                        },
                        "required": ["line", "severity", "message"],
                    },
                },
                "summary": {
                    "type": "string",
                    "description": "One-line summary of the overall file quality",
                },
            },
            "required": ["comments", "summary"],
        },
    },
}


def review_file(file_diff: FileDiff) -> dict:
    """Review a single file's diff using function calling."""
    diff_text = ""
    for hunk in file_diff.hunks:
        diff_text += hunk.header + "\n"
        diff_text += "\n".join(hunk.lines) + "\n"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a senior engineer reviewing code changes. "
                    "Focus on bugs, security issues, performance problems, and readability. "
                    "Ignore style nitpicks unless they hurt readability. "
                    "Call the submit_review function with your findings."
                ),
            },
            {
                "role": "user",
                "content": f"Review changes to `{file_diff.filename}`:\n\n```diff\n{diff_text}\n```",
            },
        ],
        tools=[review_tool],
        tool_choice={"type": "function", "function": {"name": "submit_review"}},
        temperature=0,
    )

    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

The tool_choice parameter forces the model to call submit_review every time, so you always get structured output. No need to handle text responses or parse JSON from markdown blocks.

Running the Full Review Pipeline

Wire everything together into a single function that goes from Git diff to formatted output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def run_code_review(
    diff_source: str = "staged", base_ref: str = "HEAD~1"
) -> list[dict]:
    """Run a full code review on Git changes.

    Args:
        diff_source: "staged" for staged changes, "commits" for commit range
        base_ref: Base reference for commit comparison (only used if diff_source="commits")
    """
    if diff_source == "staged":
        cmd = ["git", "diff", "--staged"]
    else:
        cmd = ["git", "diff", base_ref, "HEAD"]

    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    raw_diff = result.stdout

    if not raw_diff.strip():
        print("No changes to review.")
        return []

    file_diffs = parse_diff(raw_diff)
    all_reviews = []

    for fd in file_diffs:
        # Skip binary files and lock files
        if fd.filename.endswith((".lock", ".png", ".jpg", ".woff2")):
            continue

        print(f"Reviewing {fd.filename}...")
        review = review_file(fd)
        review["file"] = fd.filename
        all_reviews.append(review)

        # Print results
        for comment in review["comments"]:
            severity_marker = {
                "error": "[ERROR]",
                "warning": "[WARN] ",
                "suggestion": "[SUGG] ",
            }.get(comment["severity"], "[????] ")
            print(
                f"  {severity_marker} L{comment['line']}: {comment['message']}"
            )
        print(f"  Summary: {review['summary']}\n")

    return all_reviews


# Review staged changes
reviews = run_code_review(diff_source="staged")

# Or review the last commit against its parent
# reviews = run_code_review(diff_source="commits", base_ref="HEAD~1")

The output looks like this:

1
2
3
4
5
Reviewing src/auth.py...
  [ERROR]  L42: SQL query uses string formatting instead of parameterized queries. This is vulnerable to SQL injection. Use cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,)) instead.
  [WARN]   L15: Exception handler catches bare Exception. Catch specific exceptions like ValueError or KeyError to avoid masking bugs.
  [SUGG]   L28: This function is 45 lines long. Consider extracting the validation logic into a separate validate_token() function.
  Summary: Security issue in database query needs immediate fix. Error handling could be tighter.

Building the Review Comment Data Structure

When you want to post review comments back to a PR or store them for later processing, structure them as a list of comment objects. This format works whether you’re feeding it into a GitHub API call, a Slack notification, or a database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@dataclass
class ReviewComment:
    file: str
    line: int
    severity: str  # "error", "warning", "suggestion"
    message: str
    hunk_context: str = ""

def reviews_to_comments(reviews: list[dict]) -> list[ReviewComment]:
    """Convert raw review dicts to structured comment objects."""
    comments = []
    for review in reviews:
        for c in review.get("comments", []):
            comments.append(
                ReviewComment(
                    file=review["file"],
                    line=c["line"],
                    severity=c["severity"],
                    message=c["message"],
                )
            )
    return comments

def comments_to_payload(comments: list[ReviewComment]) -> list[dict]:
    """Convert comments to a JSON-serializable payload for any downstream API."""
    return [
        {
            "path": c.file,
            "line": c.line,
            "body": f"**[{c.severity.upper()}]** {c.message}",
        }
        for c in comments
    ]

# Usage
reviews = run_code_review(diff_source="staged")
comments = reviews_to_comments(reviews)
payload = comments_to_payload(comments)

print(json.dumps(payload, indent=2))
# [
#   {
#     "path": "src/auth.py",
#     "line": 42,
#     "body": "**[ERROR]** SQL query uses string formatting instead of parameterized queries..."
#   },
#   ...
# ]

That payload shape matches what GitHub’s PR review API expects for comments in a review submission. Adapt the body format for whatever system you’re posting to.

Common Errors and Fixes

subprocess.CalledProcessError: Command '['git', 'diff', '--staged']' returned non-zero exit status 128

You’re not inside a Git repository. Either cd into one before running the script, or pass cwd to subprocess.run:

1
2
3
4
5
6
7
result = subprocess.run(
    ["git", "diff", "--staged"],
    capture_output=True,
    text=True,
    check=True,
    cwd="/path/to/your/repo",
)

Empty diff when you expect changes

git diff --staged only shows staged changes. If you modified files but didn’t git add them, the diff is empty. Use git diff (without --staged) for unstaged changes, or git diff HEAD for both.

Token limit exceeded on large diffs

If a single file diff is huge (generated files, migrations, vendored code), skip it or truncate:

1
2
3
4
5
6
MAX_DIFF_TOKENS = 3000  # rough estimate: 1 token ~ 4 chars

def truncate_diff(diff_text: str, max_chars: int = 12000) -> str:
    if len(diff_text) > max_chars:
        return diff_text[:max_chars] + "\n... (truncated, diff too large)"
    return diff_text

Model returns empty comments array for obviously bad code

Your system prompt might be too conservative. Add explicit instructions like “You must flag at least security issues, unhandled errors, and potential null references.” Also check that temperature=0 is set – higher temperatures make the model more likely to skip issues.

KeyError when accessing tool_calls[0]

If the model doesn’t call the tool (rare with tool_choice set, but possible on API errors), add a guard:

1
2
3
4
5
message = response.choices[0].message
if message.tool_calls:
    return json.loads(message.tool_calls[0].function.arguments)
else:
    return {"comments": [], "summary": message.content or "No review generated"}