Embeddings — AIght

What if every word had an address? Not a dictionary entry — a location in a map of meaning, where proximity meant similarity and distance meant difference. Where "joyful" and "elated" lived a few blocks apart, "cold" and "winter" were neighbors, and "photosynthesis" was somewhere far across town from "heartbreak."

That's what are. A list of numbers — hundreds or thousands of them — that represents the meaning of a piece of text. An embedding model learned, from reading an enormous amount of text, to place similar things near each other in this space and different things farther apart. The numbers themselves are arbitrary. The distances between them are what matter.

This sounds abstract. The useful thing is that it works.

The geometry of meaning

When researchers first noticed these had spatial structure — not just proximity between synonyms, but meaningful arithmetic — it was startling enough to become a famous example.

Take the vector for "king." Subtract "man." Add "woman." The result lands very close to "queen."[·] The geometry of gender, encoded in the space, applied consistently across concept pairs.

Nobody hand-coded that. The model learned it from patterns in text. And it holds across many kinds of relationships — capitals and countries, verb tenses, comparatives. The structure of meaning shows up as structure in space.

Relationships show up as consistent directions. The step from “man” to “woman” matches the step from “king” to “queen” — so king − man + woman lands on queen. The coordinates are arbitrary; the directions carry the meaning.

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Embed a collection of texts
documents = [
    "the cat sat on the mat",
    "a feline rested on the rug",
    "machine learning optimizes parameters",
]
doc_embeddings = [embed(doc) for doc in documents]

# Find which document is closest to a query
query_vec = embed("a cat lying down")
scores = [cosine_similarity(query_vec, e) for e in doc_embeddings]
# scores ≈ [0.91, 0.87, 0.22]
# The first two are semantically close; the third is unrelated

A search for "rental car" can return results about "vehicle hire" — the query and the document live near each other even when they share no words.

◉ INTERACTIVE

Words with similar meanings end up neighbours. This is why AI can search by meaning, not just keywords.

Why this matters for AI tools

Embeddings are the foundation of most of what feels "smart" about modern AI search and retrieval. When you upload a document and an AI answers questions about it accurately, something like this happened: your document was chunked and embedded, your question was embedded, and the chunks whose embeddings landed closest to your question's embedding got handed to the model.

This is why works. The retrieval step — finding the right chunks — is a search through embedding space.

Modern sentence embedding models[·] can place not just words but whole paragraphs into the same space, which is how "talk to your docs" became a one-weekend project.

What embeddings can and can't do

They capture semantic similarity well. General-purpose embedding models are good at measuring whether two pieces of text are about the same thing. Specialized models exist for code, multilingual content, and long documents — each trained on the kinds of text they're meant to handle.

They're not magic. An embedding model has no special knowledge about your domain. If your documents use internal jargon that doesn't appear in its training data, the embeddings may not capture what you actually mean. Fine-tuned embedding models exist for exactly this reason.[·]

Dimensionality matters. Larger embedding vectors (more numbers) can capture more nuance, but take more memory and compute to search. Most production systems balance somewhere between a few hundred and a few thousand dimensions, depending on the latency-vs-accuracy tradeoff they're willing to make.

Semantic ≠ what you meant. Two sentences can be "semantically similar" in a way that doesn't help you. "This product is amazing" and "This product is terrible" are about the same thing — and embedding-similar. If you need polarity, you need more than embeddings.

The coordinates are arbitrary. The distances are not. That gap — between the representation and the meaning it encodes — is where most of the interesting questions about what AI "understands" actually live.

For now, the practical outcome is this: text can be placed in a space where proximity means something, and that space can be searched. It's a small, strange, quietly powerful idea.