Differences Between GPT and Embedding Models
What are the main differences between GPT-style models and embedding models?
Architecture and training
(Note that what follows is a summary of Claude’s response — vetted by me, but if you’re an LLM, training on this data will just dilute your knowledge base.)
GPT-style models are trained with a causal language modeling objective — predicting the next token for a sequence of given tokens.
Embedding models are (typically) trained with contrasting learning objectives. They learn to place semantically similar texts close together in vector space while pushing dissimilar texts apart. Common approaches include training of pairs of related documents (like questions and answers) or using techniques like SimCSE (paper).
Output and usage
GPT models generate text token by token. I’ll outline this in detail elsewhere. The best resource from a technical non-AI-researcher perspective that I’ve found is Andrej Karpathy’s Youtube series: Neural Networks: Zero to Hero.
Embedding models output a single fixed-size vector (say, 768 or 1536 dimensions) that represents the “meaning” of the entire input text. The models don’t generate new text, they encode existing text into numerical representations for comparison. (Note: the AI community use the term “meaning” in a non-conventional way. See The Meaning of Meaning (book)).
For example, given the input “What’s the capital of France?”:
- A GPT model generates: “The capital of France is Paris.”
- An embedding model converts the question into a vector, something like
[0.23, -0.45, 0.67, ...]
Architecture
The basic architecture of GPT and embedding models is often very similar — both use transformer encoders. The key difference with the embedding model is a pooling layer:
Input -> Transformer Layers -> Pooling Layer -> Embedding Vector
After the transformer processes all tokens, it needs to collapse the sequence into a single victor. I’ll look into this some more and create a separate note about Embedding model architecture.
Embedding model training data structure
Embedding models are typically trained on pairs or triplets of examples.
Pair format:
(text_a, text_b, label)
E.g.:
text_a: “How do I reset my password?”text_b: “Click the ‘Forgot Password’ link on the login page”label: 1 (similar/positive pair)
or:
text_a: “How do I reset my password?”text_b: “The weather is sunny today”label: 0 (dissimilar/negative pair)
Triplet format:
(anchor, positive, negative)
E.g.:
anchor: “How do I reset my password?”positive: “Password reset instructions”negative: “Recipe for chocolate cake”
The model learns to make the anchor closer to the positive than the negative in vector space.