A text watermark biases the model's token sampling toward a hidden pattern. The pattern is invisible to readers but detectable statistically by anyone with the secret key.
The basic idea
At each generation step, the model picks the next token from a probability distribution. A watermarking scheme uses a pseudo-random function (seeded by recent tokens) to split the vocabulary into "green" and "red" lists. The model is nudged toward green tokens. Over a long enough text, the green-token frequency becomes statistically detectable.
Why detection is hard
- Short text. Statistical signals need length to emerge. A tweet can't carry a robust watermark.
- Editing. Replacing 20% of the tokens often breaks the pattern.
- Multilingual translation. Round-tripping through another language destroys the watermark entirely.
- Mixed authorship. Human-edited AI text falls between the two distributions; detectors give ambiguous scores.
What this means practically
Don't trust AI-text detectors for adversarial use cases (academic fraud, deepfake provenance). They have real false-positive rates on honest human writers — especially non-native English speakers and people who write in genre patterns the model also produces.
Watermarks may help in cooperative contexts: a platform that voluntarily marks its own outputs so downstream systems can detect them.
What to read next
Model collapse is the recursive-training problem watermarking partly addresses. Synthetic data is the related curation question.