How Text Diff Algorithms Work: From Myers to Patience

Every time you run git diff, review a pull request, or resolve a merge conflict, a diff algorithm is working behind the scenes to determine the smallest set of changes that transforms one text into another. Diff algorithms are one of those foundational pieces of software infrastructure that most developers use daily without thinking about how they work. Understanding the algorithms behind diff not only satisfies intellectual curiosity — it helps you interpret confusing diffs, choose better Git settings, and build more effective comparison tools.

This article traces the evolution of text diff algorithms from the foundational Longest Common Subsequence problem through the Myers diff algorithm (the default in Git) to patience diff and beyond, with practical examples showing how each approach handles real code changes.

The Foundation: Longest Common Subsequence (LCS)

At the heart of every diff algorithm is the Longest Common Subsequence (LCS) problem. Given two sequences, the LCS is the longest sequence of elements that appear in the same order in both, but not necessarily consecutively. Once you have the LCS, everything that is not in the LCS represents either an addition or a deletion.

Consider two simple sequences:

A: [a, b, c, d, f]
B: [a, c, d, e, f]

LCS: [a, c, d, f]

Diff:
  a       (unchanged)
- b       (deleted from A)
  c       (unchanged)
  d       (unchanged)
+ e       (added in B)
  f       (unchanged)

The classic dynamic programming solution for LCS runs in O(n × m) time and space, where n and m are the lengths of the two sequences. For two files of 10,000 lines each, that is a 100-million-cell table — functional but not ideal for interactive tools that need to produce results in milliseconds.

The Edit Graph Model

A more useful way to think about the diff problem is as a shortest path through an edit graph. Imagine a grid where the x-axis represents lines of file A and the y-axis represents lines of file B. Moving right means deleting a line from A, moving down means inserting a line from B, and moving diagonally (when lines match) means keeping a line unchanged. The diff problem becomes: find the path from the top-left corner to the bottom-right corner that uses the fewest horizontal and vertical moves (edits).

Myers Diff Algorithm

Published by Eugene Myers in 1986, the Myers diff algorithm is the default algorithm used by Git and most Unix diff tools. Its key insight is to search the edit graph using a breadth-first strategy that explores paths with the fewest edits first, guaranteeing a minimal edit script — the smallest possible set of additions and deletions.

How It Works

Myers diff uses a concept of D-paths, where D is the number of non-diagonal (edit) moves. The algorithm iteratively explores all possible paths with 0 edits, then 1 edit, then 2, and so on, greedily extending each path along diagonals (matching lines) as far as possible before making the next edit move.

The algorithm can be summarized as:

  1. Start at position (0, 0) in the edit graph
  2. For D = 0, 1, 2, ... up to the maximum possible edits:
  3. For each possible diagonal k = -D, -D+2, ..., D-2, D:
  4. Determine the furthest-reaching point on diagonal k using D edit moves
  5. Extend the path along the diagonal (matching lines) as far as possible
  6. If the endpoint (n, m) is reached, trace back to reconstruct the edit script

The time complexity is O(n + m + D²) where D is the size of the minimum edit script. When the two files are similar (small D), this is much faster than the naive O(n × m) LCS approach. For identical files (D = 0), it runs in O(n) — just a linear scan confirming every line matches.

A Concrete Example

File A:           File B:
1: alpha          1: alpha
2: beta           2: gamma
3: gamma          3: delta
4: delta          4: epsilon

Myers diff output:
  alpha
- beta
  gamma
  delta
+ epsilon

Myers correctly identifies that beta was removed and epsilon was added, with the minimum total edit count of 2. This seems straightforward, but the choice of which lines to match becomes critical when there are multiple valid minimal diffs.

The Problem with Myers: Ambiguous Diffs

When there are multiple minimum edit scripts of the same length, Myers diff picks one based on its search order, which does not always produce the most human-readable output. Consider this common scenario when a function is added between two existing functions:

// Myers might produce:
  function foo() {
+   return 1;
+ }
+
+ function bar() {
    return 2;
  }

// But a human would prefer:
  function foo() {
    return 1;
  }

+ function bar() {
+   return 2;
+ }

Both diffs represent the same change with the same number of edits, but the second one is far easier to understand because it groups the new function as a cohesive block. This is where patience diff and other heuristic approaches come in.

Patience Diff Algorithm

Patience diff, developed by Bram Cohen (the creator of BitTorrent), takes a fundamentally different approach to producing human-readable diffs. Instead of optimizing purely for the minimum number of edits, it prioritizes matching unique lines — lines that appear exactly once in each file — and uses them as anchors to align the diff.

The Algorithm

  1. Find unique matching lines: Identify all lines that appear exactly once in file A and exactly once in file B, and that match between the two files
  2. Compute LCS of unique lines: Find the longest common subsequence of these unique lines — this becomes the skeleton of the alignment
  3. Recursively diff the gaps: For each section between anchored unique lines, recursively apply the patience diff algorithm
  4. Fall back to Myers: When no unique lines remain in a section, use Myers diff as a fallback

The genius of this approach is that unique lines — function signatures, class declarations, section headers — tend to be the most semantically meaningful lines in code. By anchoring the diff around these lines, patience diff produces output that aligns with how humans think about code structure.

You can use patience diff in Git with:

git diff --patience
git config diff.algorithm patience

Other Diff Strategies in Git

Git supports several diff algorithms, each with different trade-offs:

  • Myers (default) — Fast, minimal edit count. Best general-purpose algorithm. Use when speed matters and diffs are typically small.
  • Patience — Better grouping of changes, especially for code. Slightly slower than Myers. Excellent for code review where readability matters.
  • Histogram — An optimization of patience diff that handles low-occurrence (not just unique) lines. Often produces similar results to patience but faster. This is the default in JGit (used by Eclipse and Gerrit).
  • Minimal — Like Myers but spends extra time to guarantee the absolute minimum edit distance. Slower but produces the most compact diff possible.

To configure your preferred algorithm globally:

git config --global diff.algorithm histogram

Beyond Line-Level Diff: Semantic and Structural Diffing

Traditional diff algorithms operate on lines of text — they have no understanding of the content's meaning. This leads to limitations that semantic and structural diff tools address:

Word-Level and Character-Level Diff

When a single line has a small change, a line-level diff shows the entire line as deleted and re-added. Word-level or character-level diff highlights exactly which words or characters changed within the line. Git supports this with git diff --word-diff, and most modern code review tools display inline character-level highlighting within changed lines.

AST-Based (Structural) Diff

Tools like GumTree, difftastic, and Semantic parse source code into an Abstract Syntax Tree (AST) and diff at the tree level rather than the text level. This means they can understand that renaming a variable or moving a function is a single semantic operation, rather than displaying it as dozens of unrelated line changes. AST-based diffing is particularly valuable for:

  • Detecting moved code blocks (refactoring)
  • Ignoring whitespace and formatting changes that do not affect behavior
  • Understanding changes across languages with different syntactic conventions

Building a Diff Tool: Practical Considerations

If you are implementing a diff feature in your own application, here are the key decisions and trade-offs:

  • Preprocessing: Trim trailing whitespace and normalize line endings before diffing. Most unexpected diff noise comes from invisible whitespace differences.
  • Tokenization: For word-level or semantic diff, split lines into tokens (words, operators, strings) before running the diff algorithm.
  • Context lines: Show 3 lines of unchanged context around each change (the Unix diff convention). This provides enough surrounding code for a reviewer to understand the change without showing the entire file.
  • Performance: For very large files, consider using the linear-space refinement of Myers diff (described in the original paper) which reduces memory usage from O(D²) to O(D) at the cost of running the algorithm twice.
  • Unified vs. side-by-side: Unified diff (the git diff default) is compact and works well in terminals. Side-by-side diff is easier to scan visually in GUI tools. Offer both when possible.

Compare text instantly — paste two versions and see the differences highlighted side by side.

Open Diff Checker →

Conclusion

Diff algorithms are a beautiful example of a seemingly simple problem — "what changed between these two files?" — that has surprising depth. The naive O(n × m) LCS approach works but does not scale. Myers diff solved the performance problem and became the industry standard. Patience diff and histogram diff improved readability for code review. And modern AST-based tools are pushing the boundary toward truly semantic understanding of changes.

The next time a git diff shows a confusing hunk, try switching to --patience or --histogram and see if the output makes more sense. And if you need to quickly compare two text blocks without firing up a terminal, our Diff Checker is always a click away.