Google's diff-match-patch: The Algorithm That Powered Google Docs

What it does

diff-match-patch is a library that diffs plain text, fuzzy-matches strings, and applies patches even when the underlying text has changed. Available in C++, C#, Dart, Java, JavaScript, Lua, Objective-C, and Python — all from one repo, all sharing the same API.

Why I starred it

Most diff libraries give you diff. Maybe patch. This one adds fuzzy matching as a first-class operation and combines all three into a single coherent API. It was built by Neil Fraser at Google in 2006 to power the collaborative editing in Google Docs. That origin story matters — this isn't a weekend project. It was designed for a system where patches had to apply reliably against text that was being modified by multiple people simultaneously.

The repo has 8,000+ stars but the git history is a single commit from 2019. This is finished software. No churn, no breaking changes, no dependency treadmill. Each language port is a single file with zero external dependencies.

How it works

The architecture is dead simple: one class, three groups of methods (diff, match, patch), and a handful of configurable thresholds. The JavaScript implementation in javascript/diff_match_patch_uncompressed.js is 2,236 lines. That's the entire thing — diff engine, fuzzy matcher, and patch applicator.

The diff algorithm is a textbook implementation of Myers' diff with a stack of speedups layered on top. The entry point diff_main (line 105) strips common prefixes and suffixes before doing any real work. Then diff_compute_ (line 174) runs through a cascade of fast paths: equal strings, one string containing the other, single-character strings. Only when none of those match does it fall through to the actual Myers bisect at diff_bisect_ (line 316).

The bisect walks the edit graph from both ends simultaneously, checking a deadline on every iteration:

for (var d = 0; d < max_d; d++) {
  if ((new Date()).getTime() > deadline) {
    break;
  }
  // Walk the front path one step...
  // Walk the reverse path one step...
}

That Diff_Timeout defaults to 1.0 seconds. If the diff can't complete in time, it bails and returns a rough approximation. This is the kind of practical tradeoff you'd expect from production Google code — correctness when possible, degraded results over hanging.

The half-match optimization (diff_halfMatch_ at line 668) is particularly clever. Before running the full bisect, it checks whether a substring of at least half the longer text's length exists in the shorter text. If so, it splits both strings around that shared block and diffs each half independently. This turns O(n^2) into two smaller problems when the texts share a large common section — exactly what happens when you're editing a document and most of it hasn't changed.

The line-level pre-pass (diff_lineMode_ at line 246) is where things get interesting. For texts over 100 characters, it first maps each line to a single Unicode character via diff_linesToChars_:

function diff_linesToCharsMunge_(text) {
  var chars = '';
  var lineStart = 0;
  var lineEnd = -1;
  while (lineEnd < text.length - 1) {
    lineEnd = text.indexOf('\n', lineStart);
    // ...
    chars += String.fromCharCode(lineArrayLength);
    lineHash[line] = lineArrayLength;
    lineArray[lineArrayLength++] = line;
  }
  return chars;
}

It collapses each unique line into a single char, diffs the compressed version (which is now a character-level diff on what are effectively line tokens), then expands back and re-diffs the changed regions character-by-character. The limit is 65,535 unique lines — the range of String.fromCharCode. A comment on line 503 notes: "Bail out at 65535 because String.fromCharCode(65536) == String.fromCharCode(0)."

The Bitap fuzzy matcher (match_bitap_ at line 1470) is the part most people overlook. It uses bitwise operations to find approximate string matches weighted by both accuracy and proximity to an expected location. The scoring function balances edit distance against positional distance:

function match_bitapScore_(e, x) {
  var accuracy = e / pattern.length;
  var proximity = Math.abs(loc - x);
  return accuracy + (proximity / dmp.Match_Distance);
}

The match feeds directly into patch_apply (line 1819), which is the real payoff. When applying a patch, if the text has shifted, the fuzzy matcher locates where the patch target moved to and applies the changes there. It even handles imperfect matches by running a diff between the expected and actual text at the match location and remapping the patch operations through the alignment.

Using it

const dmp = new diff_match_patch();

// Diff
const diffs = dmp.diff_main("Hello World", "Hello Brave World");
// [[0,"Hello "],[1,"Brave "],[0,"World"]]

// Fuzzy match — find "World" near position 0
dmp.match_main("Hello Brave World", "World", 0);
// Returns 12

// Patch — create and apply even when text changed
const patches = dmp.patch_make("Hello World", "Hello Brave World");
const [result, flags] = dmp.patch_apply(patches, "Greetings World");
// result: "Greetings Brave World", flags: [true]

That last example is the key. The patch was created against "Hello World" but applied successfully to "Greetings World". The fuzzy matcher found where "World" ended up and inserted "Brave " in the right place.

Rough edges

The repo's git history is literally one commit. The code was clearly developed elsewhere (Neil Fraser's personal site hosts the original) and dumped onto GitHub as a snapshot. There are no issues enabled, no CI, no package.json, no npm publish. If you want to use it in a Node project, you're copying the file or relying on community-maintained wrappers like diff-match-patch on npm.

The C++ port requires Qt (QString, QList, QMap) — a heavyweight dependency for what should be a self-contained algorithm. The header at cpp/diff_match_patch.h is only 625 lines but it pulls in all of QtCore.

The Match_MaxBits limit of 32 means fuzzy matching is capped at 32-character patterns in the default configuration. The Bitap algorithm uses bitwise operations on integers, so the pattern length is bounded by the word size. The code throws an error rather than falling back: "Pattern too long for this browser." A bit abrupt.

Documentation lives on the GitHub wiki and Neil Fraser's personal site. No inline API docs beyond JSDoc comments. If those sites go down, you're reading source.

Bottom line

If you need to diff, fuzzy-match, or patch plain text and you want something battle-tested with zero dependencies, this is it. Twenty years of production use at Google scale, ported to eight languages, and the entire implementation fits in a single file per language.