Early Access: 87 spots left.

Claim

AI Concepts

No concepts available

Vectors as Meaning: What an Embedding Is, How to Make One, and Why It Works

A deep, hands-on first pass at embeddings. What a vector actually looks like in memory, how a transformer produces it, how to generate one yourself in Python and TypeScript, and the math that turns meaning into geometry.

15 min read

Keyword search, concretely

Before talking about what embeddings are, we need to be precise about what they replace. A keyword search engine (Elasticsearch, Postgres full-text search, a raw Lucene index) does not compare meanings. It compares tokens against an inverted index.

An inverted index is a flat map from every word in the corpus to the list of documents containing that word. Indexing the sentence "To end your subscription, click Cancel" produces entries that, simplified, look like this:

end → [doc_17] your → [doc_3, doc_17, doc_42, doc_88, ...] subscription → [doc_17, doc_42] click → [doc_9, doc_17, doc_55] cancel → [doc_2, doc_55, doc_81]

At query time, a search for "cancel my plan" tokenizes the query (cancel, my, plan), looks up each token, intersects the document lists, and scores the matching documents with BM25 (a formula that rewards rare terms and penalizes long documents). If no document contains any of the query's tokens, the result is an empty list, regardless of how obviously relevant some document might be.

This is the structural reason keyword search cannot find "Ending your subscription" when the query is "cancel my plan." The data structure it consults has no concept of synonymy, paraphrase, or meaning. You can paper over it with synonym dictionaries, stemming, query expansion, but the core problem is that the index is indexed by spelling, not by meaning.

What an embedding literally is

An embedding is an array of floating-point numbers produced by a neural model from a piece of text. The array length is fixed for a given model and is called the model's dimension. Common sizes: 384, 768, 1024, 1536, 3072.

That is the complete answer to "what is an embedding" at the data-structure level. It is a float[] (or List[float], or numpy.ndarray with shape (1536,)). No metadata, no pointer, no structured field. Just a row of numbers.

A real OpenAI text-embedding-3-small vector for the string "cancel my plan" starts with numbers that look like this:

[0.0213, -0.0471, 0.0102, -0.0338, 0.0187, 0.0023, -0.0255, 0.0441, ...]

All 1536 entries are floats in the range roughly -0.2 to 0.2, with most clustered around zero. The specific numbers are meaningless to read. Two different strings produce two different rows. The geometric relationship between those rows (how far apart, at what angle) is what encodes meaning.

Byte math: how much space a vector takes

A standard float is 32 bits, or 4 bytes. A 1536-dimensional vector at float32 is 1536 × 4 = 6,144 bytes, which is 6 KB per vector. Memorize that number; the whole scaling story for embeddings falls out of it.

A corpus of 1 million chunks with 1536-dim embeddings: 1,000,000 × 6 KB = 6 GB. That fits comfortably in the RAM of a single modest server.

Ten million chunks: 60 GB. Still on one machine, but now you care about the server's memory.

One hundred million chunks: 600 GB. Does not fit on a typical server; you either partition across machines, move to disk, or shrink the vectors. The shrinking options are called quantization: store as float16 (halves it to 3 GB per million, 300 GB for 100M), int8 (quarter, 150 GB), or product-quantized codes (often under 100 GB at noticeable but acceptable accuracy cost). The vector databases concept covers these.

One billion chunks at float32: 6 TB. At that point the index is the system, not an incidental part of it.

The per-query cost follows the same math. Comparing a query vector against one stored vector is 1536 multiply-adds, about 6 KB of memory traffic. Comparing against a million is 6 GB of memory traffic per query, which is why brute force stops being acceptable in the low millions.

Creating an embedding: local model in Python

The cheapest way to get your hands on real embeddings is the sentence-transformers library, which downloads an open-weights model once and runs it locally on CPU or GPU. No API key, no network, free forever.

Install: pip install sentence-transformers numpy. Total download on first run is about 100 MB for a 384-dim model.

The code below encodes three sentences and prints the actual numbers. If you paste it into a Python file and run it, you will see your own vectors in less than a minute.

Typical output on a modern laptop:

shape : (3, 384) dtype : float32 bytes per vec : 1536 first 8 of v0 : [-0.0384 0.0513 -0.0210 0.0801 -0.0157 0.0421 -0.0068 0.0290]

cos(cancel, subscription) = 0.742 cos(cancel, dinner) = 0.081 cos(subscription, dinner) = 0.090

Read those three numbers. The two semantically related sentences score 0.74, even though they share zero content words ("cancel" vs. "subscription", "plan" vs. "ending"). The unrelated sentence scores under 0.10. That gap is what makes semantic search possible. No inverted index could produce it.

Creating an embedding: hosted API from TypeScript

In production, most teams use a hosted embedding API rather than running a model themselves. The OpenAI SDK is representative; Voyage, Cohere, and Google have near-identical shapes. Each request returns one vector per input string.

Two things to notice about the API response. The usage.prompt_tokens field is what the provider bills; at $0.02 per million input tokens for the small model, embedding a 300-token chunk costs $0.000006. Cheap per call, adds up at corpus scale (a hundred million chunks averaging 300 tokens is $180).

The encoding_format option matters at scale. The default returns floats as JSON numbers, which inflates the payload; base64 returns the same floats encoded in base64 and is roughly half the transfer size. For batch indexing jobs moving billions of tokens, the bandwidth savings are real.

What the model does to produce that vector

The pipeline inside an embedding model has four stages. Knowing the stages lets you reason about why certain inputs produce certain outputs and where the geometry comes from.

Stage 1: tokenize. The input string is split into subword tokens using a fixed vocabulary (typically 32K to 100K entries). "cancel my plan" becomes something like ["cancel", " my", " plan"] with token IDs [3424, 616, 4202]. Tokenization is deterministic and fast; it is the same step discussed in Module 2 and is shared with LLMs.

Stage 2: look up token embeddings. Each token ID maps to a learned vector of, say, 384 dimensions from the model's input embedding table. At this point you have a matrix of shape (sequence_length, dim): one row per token, but each row depends only on the token's identity, not on the surrounding context.

Stage 3: pass through the transformer stack. Each transformer layer computes self-attention across the tokens and a feed-forward transformation per token. After N layers (typically 6 for small models, 12-24 for larger ones), each token's row has been enriched with context from every other token. "plan" in "a plan to exercise" and "plan" in "cancel my plan" now have different vectors, even though they entered as the same row.

Stage 4: pool into a single vector. You now have sequence_length rows, one per token; you need one row to represent the whole string. The pooling step does that. Common strategies: mean pooling (average all token rows), CLS-token pooling (use the special [CLS] token's row), last-token pooling (use the final token's row, common in decoder-style embedders). The output of pooling is the embedding. A final L2-normalization step divides by magnitude to produce a unit vector, which makes cosine similarity equal to dot product.

The geometry of the final vector is a consequence of all four stages, but the bulk of the semantic work happens in stage 3, where the transformer attention machinery lets every token's representation condition on the entire input.

Loading diagram…

Why does the geometry encode meaning

Nothing about the pipeline above guarantees semantic structure. A randomly initialized transformer would produce random-looking vectors where unrelated sentences happen to land close. The meaning comes from how the model was trained.

Embedding models are trained with a contrastive objective. You curate millions of pairs of texts labeled as either semantically related (positive pairs: a question and its accepted answer, two translations of the same sentence, two paraphrases, a query and a clicked-on document) or unrelated (negative pairs: a question and a random other document).

The training loss function has a simple shape: minimize distance between positive pairs, maximize distance between negatives. Written as a formula in a modern variant (InfoNCE), it looks like:

loss = -log( exp(sim(q, d⁺) / τ) / Σᵢ exp(sim(q, dᵢ) / τ) )

where q is a query vector, d⁺ is the positive document's vector, dᵢ ranges over the positive and a batch of negatives, and τ is a temperature constant. Intuitively, the numerator rewards the model for assigning high similarity to the correct pair; the denominator penalizes it for also assigning high similarity to negatives.

Backpropagate this loss over billions of pairs and the model is forced to organize its output space so that related things end up near each other in cosine distance. The geometry falls out as the cheapest way to satisfy the objective at scale.

The underlying linguistic principle is the distributional hypothesis: words (and sentences) that appear in similar contexts tend to have similar meanings. The training pairs are proxies for "appearing in similar contexts," and the model generalizes from them to text it has never seen.

A small sensitivity demo

The best way to build intuition for what embeddings respond to is to poke at them. Run the Python snippet above and swap the texts list for these six variations:

'cancel my plan' 'Cancel my plan.' 'cancel my subscription plan' 'plan cancel my' 'I want to cancel my plan' 'keep my plan'

Compute cosine similarity of each against the first. You will see something like:

cancel my plan : 1.000 (itself) Cancel my plan. : 0.992 (capitalization + punctuation barely matter) cancel my subscription plan: 0.896 (adding a synonym stays close) plan cancel my : 0.723 (scrambled word order hurts but survives) I want to cancel my plan : 0.921 (more context, same intent, still close) keep my plan : 0.681 (opposite intent, concerning)

The last row is the canonical warning: a sentence with the opposite intent (keep vs. cancel) shares so much structure that it lands relatively close. This is the similar-but-wrong failure mode that later concepts (reranking, LLM judging) are designed to correct. The geometry is powerful but not directional; it captures topic more than stance.

Storage formats, briefly

When you persist a vector, the representation matters. The options, in order of popularity:

Float32 arrays. The default. Full precision. Used everywhere by default.

Float16 (half precision). Two bytes per entry, half the storage, negligible accuracy loss for retrieval. Supported natively by most vector databases and by NumPy. A reasonable default once your corpus is large enough that storage matters.

Int8 (scalar quantization). One byte per entry. Each value is stored as an integer, with a per-vector or per-dimension scale factor so you can decode back to approximate floats. Roughly 75 percent storage savings versus float32, with a small (1-3 percent) recall hit.

Product-quantized codes. The vector is split into sub-vectors, each sub-vector is mapped to an entry in a small lookup table, and only the index is stored. Can compress 1536-dim float32 vectors (6 KB) into 16-byte codes (350x smaller). Recall drops more noticeably (5-15 percent) but for corpora in the hundreds of millions this is often the only way the index fits in RAM.

Binary or hash codes. Extreme compression (1 bit per dimension). Rarely used for semantic retrieval; they lose too much.

Writing the vector to a database is usually a matter of choosing the database's column type and letting it handle format. pgvector has a vector type that stores float32 on disk. Pinecone stores vectors as float32 by default with optional quantized modes. FAISS lets you pick the index type, which implicitly fixes the storage format.

What embeddings are not

Three places the mental model trips people up.

Not a summary. You cannot read back a vector to reconstruct the text. A 1536-dim vector encodes maybe a few hundred bits of actually useful information, far less than the original string. It is a fingerprint with geometric properties, not a compressed document.

Not interchangeable across models. A vector from OpenAI's text-embedding-3-small and a vector from Cohere's embed-english-v3 are in completely different spaces. Their numbers have no relationship. You cannot mix them in one index, you cannot compare them, you cannot convert between them. Everything in a single retrieval system must come from the same model version.

Not a reasoning engine. Embeddings find what is similar; they do not answer questions, follow logic, or decide what is correct. A RAG pipeline pairs the similarity step (embeddings) with a reasoning step (LLM) because each is strong where the other is weak.

The one-sentence summary

An embedding is a row of numbers whose geometric relationships approximate semantic relationships. Everything else in this module is how to produce, store, search, and evaluate those numbers in ways that hold up in production.

Why this matters for system design

Almost every AI feature that is not purely generative has an embedding step. Semantic search, RAG, recommendations, deduplication, clustering of support tickets, classification of free-text inputs, anomaly detection on text: all of them begin by converting strings to vectors and comparing distances.

This means the retrieval layer sits under the LLM call, not next to it. If the retrieval layer returns the wrong chunks, the LLM never sees the right context, and no amount of prompt engineering downstream recovers. If the retrieval layer is fast, cheap, and accurate, the rest of the system has a chance.

Most of the engineering work in a mature RAG system is at this layer: picking the right embedding model, chunking the corpus sensibly, choosing a storage and search strategy that scales, and measuring retrieval quality so you know whether to add reranking, hybrid search, or query translation. The rest of this module works through each of those decisions, grounded in what you learned here: that a vector is a row of floats, that the geometry came from contrastive training, and that the numbers are only useful to the extent that the pipeline around them is well-designed.

  • An embedding is a fixed-length array of floats produced by a neural model from a piece of text. 384, 768, 1024, 1536, or 3072 entries, depending on the model.
  • Storage math: float32 × dim × 4 bytes gives size per vector. 1536-dim is 6 KB. A million vectors is 6 GB. A hundred million is 600 GB before compression.
  • You can create embeddings locally with sentence-transformers (free, CPU-fine) or through a hosted API like OpenAI or Voyage (fast, pay-per-token).
  • Internally: tokenize, look up token embeddings, pass through transformer layers, pool to one vector, L2-normalize. The geometry comes from contrastive training on billions of related-vs-unrelated text pairs.
  • Cosine similarity between unit vectors is just the dot product. Two semantically related sentences score 0.6 to 0.9; two unrelated ones score under 0.2. Exact thresholds are model-specific.
  • Embeddings encode topic strongly and stance weakly. "cancel" and "keep" land surprisingly close. Expect similar-but-wrong failures and plan correction stages.
  • Vectors from different models live in incompatible spaces. One model per index. Always tag vectors with their model version.
  • An embedding is not a summary, not a reasoning engine, not universal. It is a fingerprint with geometric structure that happens to be useful for retrieval.

On this page