Choosing an Embedding Model: Dimension, Quality, Domain, Cost
The embedding model you pick is the map your retrieval system lives on. This lesson walks through the real tradeoffs: dimension count, benchmark scores, domain fit, multilingual coverage, cost, and the migration problem nobody plans for.
Why this choice is load-bearing
Every vector in your index was produced by some specific embedding model. Change the model and you do not just improve quality a bit; you move to a different space where none of your old vectors are valid. The geometry is different, the distances are different, a query embedded in the new space cannot be compared against documents embedded in the old space.
This makes the embedding model a semi-permanent choice. You can swap your LLM without re-generating anything; you can change your vector database with an export and an import; you cannot swap your embedding model without re-embedding your entire corpus. For a small corpus that is an hour of work. For a corpus with hundreds of millions of chunks that is a coordinated migration.
The practical consequence: pick once, but pick carefully. The rest of this concept walks through the dimensions that actually differ between models, because the marketing material will not help you here.
What the dimension count actually buys you
Modern embedding models come in a few standard sizes: 384, 768, 1024, 1536, 3072 dimensions. Larger models produce larger vectors. The natural intuition is that more dimensions equals higher quality. That is roughly true but with two caveats that change your choice.
Diminishing returns. Going from 384 to 768 dimensions on a good model might improve retrieval quality by 5 to 8 percent. Going from 1536 to 3072 might add 1 to 2 percent. The quality gains flatten quickly, while the storage and query costs keep growing linearly.
Dimension count is coupled to model family, not independent of it. A 768-dimensional model from one vendor may outperform a 1536-dimensional model from another on your data, because the training corpus and objective matter more than the output size. Do not compare models on dimension count alone; compare them on the benchmark that matches your use case.
Storage cost at scale is real. At 1536 float32 dimensions, each vector is 6 KB. One million chunks is 6 GB; one hundred million is 600 GB. Halving the dimension count halves the storage bill. For a RAG system where the bulk of your cost is vector store storage, this math matters. For a small-corpus feature where you have ten thousand chunks, it does not.
A useful shape: start with the mid-range option from a reputable provider (typically 768 or 1024), measure quality on your data, and only reach for the larger model if the measurements say you need it.
MTEB and how to read benchmarks
The Massive Text Embedding Benchmark (MTEB) is the standard leaderboard for embedding models. It aggregates scores across dozens of tasks, retrieval being one of them. Any serious model provider publishes MTEB numbers.
Reading MTEB requires care. The aggregate score covers many task types (classification, clustering, reranking, retrieval, semantic similarity) that weight dimensions of quality you may not care about. A model at rank 3 overall might be at rank 1 for retrieval, which is usually what you actually need. Always look at the retrieval-specific score, not the overall number.
More importantly, MTEB is evaluated on English general-purpose text. If your domain is medical, legal, code, multilingual, or conversational, MTEB ranks do not predict your results well. A model that dominates MTEB on Wikipedia-style text can be mediocre on clinical notes, because it was never trained on that kind of writing.
The honest test is your own data. Build a small labeled set, two hundred query-document pairs where you know the correct answer, and measure each candidate model on that set. Use recall@k (how often the correct document is in the top k) as your primary metric. This takes an afternoon and is more trustworthy than any leaderboard.
Domain fit beats benchmark rank
A general-purpose embedding model is trained on web text, books, forums, roughly the kind of writing you find on the public internet. For most general use cases, this is fine. For specialized domains it is not.
Medical text is the canonical example. The phrase "MI" in a clinical note means myocardial infarction; in general text it is a state or a unit or nothing. A general-purpose model sees the abbreviation and has no strong signal for what it means. A medical-domain model trained on PubMed abstracts and clinical notes has a coherent geometry in that neighborhood. The difference in retrieval quality on medical data can be large.
Code is another example. A general model handles natural language well but lumps together functions that have different behavior if their comments or variable names look similar. A code-specific embedding model (e.g., from a provider that fine-tunes on GitHub) knows that two functions with identical bodies but different comments should be close, while two with similar comments but different bodies should not.
Legal, financial, scientific, internal-jargon-heavy enterprise corpora are all similar stories. If you are in a specialized domain, the question to ask is whether a domain-specific embedding model exists and whether the benefit outweighs the downside of fewer provider choices and less frequent updates. Sometimes a general model plus good chunking beats a mediocre domain model; sometimes the domain model wins outright. You have to measure.
Multilingual: one model or many
If your content spans languages, you have two shapes to choose between.
One multilingual model. Trains a single model on many languages, so a query in French and a document in Spanish can still be compared in the same space. The quality per language is usually lower than a dedicated monolingual model would achieve, but the operational simplicity is a big win: one index, one pipeline, one set of thresholds.
One model per language. Better per-language quality, but you now need to route queries to the right model, you cannot compare across languages easily, and your index is partitioned by language. For a user who searches in English and expects to find French documents translated into the index, this is a poor shape.
For most production systems serving a multilingual user base, a single high-quality multilingual model is the right default. The exceptions are domains where quality per language is make-or-break (e.g., legal translation services) and the operational complexity of per-language pipelines is worth it. For internal tools, productivity tools, customer support, and retrieval on mixed-language corpora, the multilingual model wins on simplicity and the quality gap is usually acceptable.
Cost: the dimension people forget
Embedding costs break into three parts, and people often remember only the first one.
Per-call API pricing. Providers charge per token embedded. For a corpus of a hundred million chunks averaging 300 tokens each, one pass through a mid-priced embedding model is in the low thousands of dollars. This is often a one-time cost, or a periodic one as you re-index. Cheap models can be ten times cheaper than premium ones; if your quality measurements show the cheap model is good enough, the savings on initial indexing alone justify measuring.
Storage cost in the vector store. Already discussed above: larger dimensions cost more to store. For a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud, etc.), this shows up as a per-vector or per-GB monthly bill that dominates the budget at scale.
Query-time latency cost. Dimensions affect query latency linearly. A 3072-dimensional cosine comparison is four times slower than a 768-dimensional one. For a retrieval system serving thousands of queries per second, this is real latency budget and real CPU cost. Doubling the dimension count can double your retrieval infrastructure cost.
Treat the choice of model as a cost decision as much as a quality one. A common pattern: use a smaller, cheaper model for initial retrieval of a wider set of candidates, then use a more expensive reranker on the shortlist. The reranking concept in Module 7 covers this in depth. For now, the point is that model choice and system architecture are intertwined; a cheaper retrieval layer may actually produce better end-to-end quality if it lets you spend the budget on reranking.
Measure twice
Model choice is semi-permanent because re-embedding an entire corpus is expensive and disruptive. Pick with the three-year horizon in mind, not next week's demo.
Hosted vs. self-hosted
You can get embeddings from a hosted API (OpenAI, Voyage, Cohere, Google) or run an open-weights model yourself (sentence-transformers family, BGE, E5, Nomic). Both paths work. The decision comes down to three factors.
Data sensitivity. Hosted APIs receive a copy of every text you embed. For a medical, legal, or regulated-enterprise use case, this is often a non-starter, and self-hosting is the only option. For general content, the provider's data policies may be fine; check what they offer for enterprise tiers (zero retention, regional processing).
Cost curve. Hosted APIs are cheap to start and expensive at scale. Self-hosted is expensive to start (GPU or optimized CPU server, plus the engineering to keep it running) and cheap at scale. The crossover point depends on your volume; a rough rule is that embedding tens of millions of tokens per day pushes you toward self-hosted, while under a million per day hosted is almost always cheaper when you account for engineering time.
Quality and convenience. The top hosted models are usually slightly ahead of the best open-weights models in aggregate quality, though the gap has narrowed sharply. Hosted APIs handle batching, scaling, updates, and retries for you. Self-hosted means you run a service, and that service must be as reliable as the rest of your LLM infrastructure. For many teams, the operational overhead of self-hosting is the hidden cost that tips the decision back toward hosted.
Most teams start hosted, pay attention to the bill, and move to self-hosted only when a specific threshold (cost, data, or latency) forces the switch. That is the right order.
The migration problem every team underestimates
A year into running your retrieval system, you will want to change the embedding model. Maybe a better model came out. Maybe your domain shifted. Maybe the provider deprecated the version you are on. This is not a question of if but when.
Migration has three pieces that need to land together.
Re-embed the corpus with the new model. For a large corpus this is a batch job that takes hours or days and costs whatever your provider charges per token. Plan it carefully; a failed batch in the middle is painful to restart if your tracking is not ready.
Switch the query path to the new model. Queries must be embedded through whichever model produced the index you are searching. During the migration, you will have both an old index and a new index, and you need a feature flag or traffic split that routes each query through the matching model. Running one query through both models (shadow mode) lets you measure the quality difference on live traffic before cutting over.
Delete the old index. Cheap to forget, expensive to keep. The old vectors are paying storage rent for nothing once the new index is live; have a date on the calendar to delete them.
The piece that trips up most teams is not the technical migration, it is the version tracking. If the record for each vector does not say which model version produced it, you cannot tell which vectors are old and which are new during the crossover period. Store the model version and the corpus version on every vector from day one. A later concept on re-embedding and versioning covers the full playbook; the takeaway here is that your schema should anticipate this the day you pick your first model, not the day you need to change it.
A practical decision checklist
When you are making the choice, walk through these questions in order.
What is my dominant content type: general text, code, specialized domain? If specialized, does a domain-specific model exist? If yes, shortlist it even if its MTEB rank is lower than a general model.
Does my system serve multiple languages? If yes, prefer a multilingual model unless per-language quality is a differentiated business requirement.
Can I send this content to a third-party API? If no, self-hosted is the only option, and the choice narrows to what runs on your infrastructure.
What is my corpus size today and in a year? Small corpus, cheap models are fine. Large corpus, model choice dominates storage cost and re-embedding cost.
Have I measured the top 3 candidates on 200+ of my own query-document pairs? If no, do this before committing.
Have I picked a dimension size that matches my latency and storage budget? The biggest model is not always the right one even if it is slightly more accurate.
Is my schema tracking model version per vector? If no, fix this before the first vector lands in the index.
Working through that list is not glamorous. It is the difference between a retrieval system you are happy with in year two and one you are planning a painful migration out of.
- The embedding model is the map your retrieval system lives on. Changing it means re-embedding the entire corpus.
- More dimensions give diminishing returns on quality and linear growth in storage and query cost. Mid-range options are usually the right default.
- MTEB is a useful leaderboard but does not predict your specific domain. Measure on your own data with your own queries.
- Domain fit (medical, legal, code, multilingual) often beats benchmark rank. Specialized models exist for a reason.
- Three cost axes: per-token API pricing, vector storage, query-time compute. All three scale with dimension and volume.
- Hosted APIs are the default starting point. Move to self-hosted when data sensitivity, scale, or latency force it.
- Plan the migration path from day one: tag every vector with its model version, reserve capacity for dual-index crossover, schedule old-index deletion.