Text Diffing: Comparing Code for Review & Debugging

What a Diff Shows You

A diff is the output of comparing two versions of text and showing exactly what changed. Lines present in both versions appear as plain context. Lines in the original but not the modified version get a minus prefix and typically render in red. Lines in the modified but not the original get a plus prefix and render in green. This format — the unified diff — is the universal language of code review. Every pull request on GitHub, every commit in git, every patch file uses this same convention.

The format was designed by Larry Wall (creator of Perl) for the Unix diff command in the early 1970s, but it was the diff -u (unified) format from Wayne Davison in 1990 that became the standard. The "unified" name comes from showing both the old and new versions in a single view with context lines, rather than as separate columns.

  This line is identical in both versions — it provides context
- This line was removed from the original document
+ This line was added in the modified document
  This line is also unchanged — more context around the change

How Diff Algorithms Work

The algorithm that powers most diff tools is based on the longest common subsequence (LCS) problem: given two sequences, find the longest sequence that appears in both in the same order. The differences are everything not in the LCS. The classic solution is the Myers diff algorithm (1986), which runs in O(ND) time where N is the total length and D is the number of differences. When the two inputs are similar (most code changes), D is small and the algorithm is fast. When they're completely different, it degrades to O(N²).

Modern diff tools use variations and optimizations of Myers' algorithm. Git uses a combination of Myers diff for line-level comparison and a histogram-based algorithm for better results on structured content. The "patience diff" algorithm (used by git diff --patience) does a better job of matching unique lines — like function signatures — before filling in the common lines between them. This often produces more readable diffs for code with moved blocks.

Reading Diffs: The Hunk Header

Every diff "hunk" (a block of changes with surrounding context) starts with a header showing the line ranges in both files:

@@ -10,7 +10,8 @@ function processData(items) {

This means: in the old file, starting at line 10, the next 7 lines are shown. In the new file, starting at line 10, the next 8 lines are shown. The text after @@ is the closest function or section heading — git tries to show you what function you're in, which is invaluable when scrolling through a large diff.

Line-by-Line vs. Character-Level Diffs

Line-by-line comparison treats each line as atomic. It's fast, it's what git uses, and it's what most code review tools display. The downside: a one-character change on a long line shows the entire line as modified, even though 99% of the characters are unchanged. Word-level and character-level diffs go deeper, highlighting exactly which characters or words within a line are different. GitHub added a "split diff" view option in 2021 that shows word-level highlighting within changed lines. For catching a single-character typo in a long configuration line, this is the difference between spotting the bug instantly and searching for it for minutes. Our Diff Checker uses line-by-line comparison, which covers most text comparison needs.

Common Diff Workflows

Code review. The diff is the unit of code review. Before merging a PR, scan the diff for: logic changes that should be accompanied by test changes (are they?), new dependencies or imports (are they approved?), hardcoded values that should be configurable, debug logging that was accidentally committed, and changes to files you didn't expect to be modified. The size of the diff is itself a signal — a 500-line diff across 20 files warrants different scrutiny than a 3-line fix.

Configuration debugging. "The app works on staging but not production." Export both configurations, diff them. The culprit is often a single line — a database URL pointing to the wrong environment, a feature flag that's off, an API key that's missing, a memory limit that's lower. A diff makes the difference instantly visible, while manually comparing two config files side by side is an exercise in missing the one line that matters.

Test output comparison. Automated tests that fail with "expected X, got Y" are vastly improved by showing a diff of X vs Y. Every modern test framework does this for assertion failures. For custom test scripts that don't, piping output through diff is a one-line addition that saves hours of debugging.

Document versioning. Track changes between versions of contracts, policies, or specifications. Lawyers and compliance officers routinely use redline comparisons — a diff by another name. Understanding what changed between version 2.3 and 2.4 of a dependency's terms of service can be a legal requirement.

Log file analysis. Compare log output from a working run against a failing run. The diff immediately surfaces what's different — an extra error message, a missing initialization line, a different configuration value being loaded.

Tools Beyond git diff

Git's built-in diff is the standard, but several specialized tools exist for different workflows: git diff --word-diff highlights changed words within lines using inline markers. git difftool opens an external visual diff tool. vimdiff shows files side by side in Vim with color-coded differences — steep learning curve but extremely fast for Vim users. meld provides a GUI for directory-level comparison, useful for comparing entire project trees. GitHub, GitLab, and Bitbucket all have web-based diff viewers with syntax highlighting and inline comment threading. Our Diff Checker provides a no-setup, browser-based way to compare two blocks of text in seconds.

Quick Reference

Command	Purpose
git diff	Unstaged working tree changes
git diff --staged	Staged changes before commit
git diff main...feature	Branch divergence from main

When Diffs Are the Wrong Tool

Diffs work well for source code and structured text. They work poorly for binary files — the output is meaningless. They are semantically unaware: moving a function shows as deletion and addition with no indication the code is the same. Git can detect renamed files with --find-renames, but this is heuristic. For semantically meaningful comparisons — did this function behavior change? — you need tests, not diffs.

Diffing Non-Code Content

Diffs are useful for more than code. Compare two versions of a CSV export to find what changed between daily data dumps. Diff two log files from different servers to identify why one is behaving differently. Compare configuration files between environments to find the one setting that is different. Diff two SQL dump files to verify a migration produced identical schemas. In each case, the diff shows exactly what changed — and what did not change — in a format that is scannable and precise.

For binary file formats, specialized diff tools exist. Image diff tools highlight visual differences between two screenshots (useful for visual regression testing). PDF diff tools compare text content between document versions. JSON diff tools compare structured data semantically, ignoring key order and whitespace differences. Use the right diff tool for the content type — a text diff of two binary files produces meaningless output.