AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Embeddings

The coordinates that give language a sense of direction

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITEmbeddingsTokenizationTokenization — The first thing every model does to your words — and the thing that quietly limits what it can do.Retrieval-Augmented GenerationRetrieval-Augmented Generation — How AI learned to look things up before opening its mouthTransformersTransformers — The architecture that changed what AI could do with language — and then everything elseMultimodal ModelsMultimodal Models — When AI learned to see, listen, and read — at the same time, in the same headPerplexityPerplexityNotebookLMNotebookLMConsensusConsensusAlphaSenseAlphaSenseCommon misconception: Embeddings are just word2vec.Common misconception: Closer in the embedding space means closer in meaning.Common misconception: Embeddings from different providers are interchangeable.
prereqsrelatedtoolsmisconceptions
shows up in:Biology & Life SciencesChemistry & Materials ScienceHistory & HumanitiesPharmacy & Drug Discovery
You might think:Embeddings are just word2vec.Closer in the embedding space means closer in meaning.Embeddings from different providers are interchangeable.

What if every word had an address? Not a dictionary entry — a location in a map of meaning, where proximity meant similarity and distance meant difference. Where "joyful" and "elated" lived a few blocks apart, "cold" and "winter" were neighbors, and "photosynthesis" was somewhere far across town from "heartbreak."

That's what are. A list of numbers — hundreds or thousands of them — that represents the meaning of a piece of text. An embedding model learned, from reading an enormous amount of text, to place similar things near each other in this space and different things farther apart. The numbers themselves are arbitrary. The distances between them are what matter.

If you've ever wondered how a search engine knows that "cheap flights to Rome" and "affordable airfare to Italy" mean roughly the same thing, you're wondering about embeddings.

This sounds abstract. The useful thing is that it works.


The geometry of meaning

When researchers first noticed these had spatial structure — not just proximity between synonyms, but meaningful arithmetic — it was startling enough to become a famous example.

Take the vector for "king." Subtract "man." Add "woman." The result lands very close to "queen."[·] The geometry of gender, encoded in the space, applied consistently across concept pairs.

Nobody hand-coded that. The model learned it from patterns in text. And it holds across many kinds of relationships — capitals and countries, verb tenses, comparatives. The structure of meaning shows up as structure in space.

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Embed a collection of texts
documents = [
    "the cat sat on the mat",
    "a feline rested on the rug",
    "machine learning optimizes parameters",
]
doc_embeddings = [embed(doc) for doc in documents]

# Find which document is closest to a query
query_vec = embed("a cat lying down")
scores = [cosine_similarity(query_vec, e) for e in doc_embeddings]
# scores ≈ [0.91, 0.87, 0.22]
# The first two are semantically close; the third is unrelated
“Keyword search finds words. Semantic search finds meaning. Embeddings are why those two things are different now.

A search for "rental car" can return results about "vehicle hire" — the query and the document live near each other even when they share no words.


◉ INTERACTIVE

kingqueenroyalcrowncatdogpetanimalpythoncodefunctionalgorithmjoylovewarmthhope

Words with similar meanings end up neighbours. This is why AI can search by meaning, not just keywords.

Why this matters for AI tools

Embeddings are the foundation of most of what feels "smart" about modern AI search and retrieval. When you upload a document and an AI answers questions about it accurately, something like this happened: your document was chunked and embedded, your question was embedded, and the chunks whose embeddings landed closest to your question's embedding got handed to the model.

This is why works. The retrieval step — finding the right chunks — is a search through embedding space.

It's also why semantic search feels different from old-school keyword matching: you search for "how to deal with a difficult manager" and surface a note about "navigating a tough workplace relationship" — same neighborhood, totally different words.

Modern sentence embedding models[·] can place not just words but whole paragraphs into the same space, which is how "talk to your docs" became a one-weekend project.


What embeddings can and can't do

They capture semantic similarity well. General-purpose embedding models are good at measuring whether two pieces of text are about the same thing. Specialized models exist for code, multilingual content, and long documents — each trained on the kinds of text they're meant to handle.

They're not magic. An embedding model has no special knowledge about your domain. If your documents use internal jargon that doesn't appear in its training data, the embeddings may not capture what you actually mean. Fine-tuned embedding models exist for exactly this reason.[·]

"Cosine similarity" sounds intimidating. It's literally: how much do these two arrows point in the same direction? That's the whole metric.

Dimensionality matters. Larger embedding vectors (more numbers) can capture more nuance, but take more memory and compute to search. Most production systems balance somewhere between a few hundred and a few thousand dimensions, depending on the latency-vs-accuracy tradeoff they're willing to make.

Semantic ≠ what you meant. Two sentences can be "semantically similar" in a way that doesn't help you. "This product is amazing" and "This product is terrible" are about the same thing — and embedding-similar. If you need polarity, you need more than embeddings.


The coordinates are arbitrary. The distances are not. That gap — between the representation and the meaning it encodes — is where most of the interesting questions about what AI "understands" actually live.

For now, the practical outcome is this: text can be placed in a space where proximity means something, and that space can be searched. It's a small, strange, quietly powerful idea.

← Back to all conceptsBrowse tools →
intermediate
Read time7 min read
UpdatedMay 2026
Sources2

Read next

  1. Retrieval-Augmented Generation →
  2. Transformers →
  3. Multimodal Models →
  4. Tokenization →