AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Multimodal Models

When AI learned to see, listen, and read — at the same time, in the same head

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITMultimodal ModelsTransformersTransformers — The architecture that changed what AI could do with language — and then everything elseEmbeddingsEmbeddings — The coordinates that give language a sense of directionDiffusion ModelsDiffusion Models — How AI learned to make images by starting with pure noise and finding the signalChatGPTChatGPTGeminiGeminiClaudeClaudeMidjourneyMidjourneyCommon misconception: Multimodal models 'see' images the way we do.Common misconception: All modalities in a multimodal model are equally well-trained.Common misconception: Multimodal means the model can do anything in any format.
prereqsrelatedtoolsmisconceptions
shows up in:Medicine & HealthcareFilm & Video ProductionArchitecture & Urban DesignMusic & Audio
You might think:Multimodal models 'see' images the way we do.All modalities in a multimodal model are equally well-trained.Multimodal means the model can do anything in any format.

For the first few years of the modern AI era, language models were exactly that — language models. Text in, text out. You typed. They wrote.

Then came the obvious question: why should intelligence be constrained to one channel? Humans don't experience the world through words alone. We see, hear, read, and speak in a single integrated stream of cognition. A doctor reading an X-ray and a doctor reading a patient note are using the same underlying capacity. Why should AI be different?

If you've ever pasted a screenshot of an error into ChatGPT and asked "what's wrong with this code," you've used a multimodal model — even if nobody told you that's what was happening.

are the answer. They accept multiple types of input — images, audio, video, documents — and can reason across all of them within a single response. GPT-4o, Gemini, and Claude's vision capabilities are all multimodal systems.


How images get into a language model

A language model processes . Images are not tokens. Something has to translate.

The standard approach uses a separate vision encoder — typically a Vision Transformer (ViT)[·] — to convert an image into a series of numeric . These image embeddings get projected into the same vector space as text tokens, so the language model sees them as "visual tokens" sitting alongside the regular word tokens.

Image pixels
    ↓
Vision encoder (ViT)
    ↓
Image patch embeddings
    ↓
Projection layer (same dimensionality as text tokens)
    ↓
Language model processes [text tokens] + [image tokens] together
    ↓
Text output
CLIP[·] trained vision and text encoders to land in the same embedding space — the move that made all of modern vision-language possible.

The key insight is that images get divided into patches (typically 14×14 or 16×16 pixels each), each patch becomes a token, and those tokens are processed by the mechanism alongside the text. The model learns, through training on image-text pairs, what these visual tokens mean in relation to language.


What multimodal models can actually do

The capabilities that emerge are wider than "describe this image."

Visual reasoning. "Here is a graph of sales data. What trend is most concerning?" The model isn't just reading a caption — it's interpreting a chart, relating the visual information to a question, and producing analysis.

Document understanding. A photographed invoice, a handwritten note, a PDF with complex tables — multimodal models can read layouts that pure OCR struggles with, because they understand context, not just character sequences.

Code from screenshots. "Here's a screenshot of a UI I want to build. Write the HTML and CSS." This is a pattern developers actually use, daily.

Visual QA in domain contexts. Medical imaging, satellite interpretation, quality control in manufacturing — in each case, the model is answering questions about what it sees.

Interleaved content. You can pass a document that mixes images and text — a research paper with figures, for example — and the model processes them together rather than treating text and images as separate inputs.

“Multimodal isn't just "also accepts images." It's one reasoning system where visual and textual information inform each other in the same pass.

Audio and video

Image understanding was the first frontier. Audio and video are harder.

Audio requires its own encoder — similar in concept to the vision encoder, but processing spectrograms or waveforms. GPT-4o and Gemini can accept audio directly. Models trained on audio can transcribe, identify speakers, describe ambient sounds, and reason about tone. The voice-to-voice capabilities in GPT-4o use this: audio in, model processes it as tokens, audio out via a separate synthesis step.

"Real-time voice" sounds magical until you realise the model is still doing the same token prediction underneath — it's just running fast enough on cheap enough hardware to feel like a conversation.

Video is audio and images multiplied by time. A one-minute video at 24 frames per second is 1,440 images. Even with aggressive frame sampling, this is token-intensive. Current approaches include sampling a fixed number of frames across the video, encoding only keyframes, and using specialised temporal models that understand motion rather than treating each frame independently.[·]


The limitations that matter in practice

Resolution limits. Vision models process images at a fixed resolution or in fixed-size patches. Very high-resolution images — technical drawings, satellite imagery — often get downsampled, which can lose detail. Some models tile the image into chunks to preserve more detail at the cost of more tokens.

Not truly integrated perception. Current multimodal models process visual inputs through a pipeline, not as a unified perceptual system. The visual information influences the language output, but the model doesn't really "see" the way the marketing copy implies — it's inference-time conditioning, not genuine sight.

Easy way to check: ask a vision model to count something difficult in an image — say, how many windows are on a complicated building. The answers wander. The model isn't really counting, just guessing what number-words plausibly come next.

extends to images. Models can confidently describe things in an image that aren't there, especially under ambiguity or at image edges. The same pattern-completion tendency that makes text models hallucinate facts makes vision models hallucinate details.

Video understanding is immature. Describing a single frame and describing what happened over thirty seconds are very different tasks. Most models are better at the former. Temporal reasoning — understanding cause, effect, and sequence in video — remains a weak point.


Why it matters beyond the demos

The interesting question isn't "can the model read an image" — it clearly can. The interesting question is what becomes possible when the bottleneck of "translation into text" disappears.

A doctor can now hand a model an X-ray and a patient history simultaneously. A lawyer can process a contract's scanned pages without converting them first. A factory engineer can point a camera at equipment and describe a problem in natural language. The interface to AI stops requiring that the user first convert their world into words.

That shift — from text interface to perceptual interface — is still early. The models are capable. The workflows that use these capabilities well are being figured out now.

← Back to all conceptsBrowse tools →
intermediate
Read time7 min read
UpdatedMay 2026
Sources2

Read next

  1. Transformers →
  2. Embeddings →
  3. Diffusion Models →