For embeddings, the key metrics are embedding dimension size (higher = more expressive) and max input tokens per text chunk. Cohere offers a free trial tier for its embed models. NVIDIA NIM provides free embedding endpoints. Cloudflare Workers AI includes embedding models in its free tier. These are not chat models — they convert text to vectors for semantic search and RAG pipelines.
What to Look for in a Embedding Model
Embedding models convert text into fixed-size vectors. Unlike chat models, they don't generate text — they produce numerical representations for comparison and retrieval:
- Embedding dimension — The vector size. Common dimensions: 384 (lightweight), 768 (balanced), 1024–4096 (high precision). Higher dimensions capture more nuance but require more storage and slower similarity search. For most RAG applications, 768–1024 dimensions work well.
- Max input tokens — The longest text chunk the model can embed in one call. 512 tokens handles paragraphs; 8192 tokens handles entire documents. For long documents, you can either chunk them or use a model with a high token limit.
- Normalization — Most embedding models L2-normalize their outputs, making cosine similarity equivalent to dot product. This matters for vector database setup — check if you need to normalize post-hoc.
- Multilingual support — Some embedding models only work well for English. If you need multilingual semantic search, look for models with explicit multilingual training (e.g., Cohere embed-multilingual, multilingual-e5).
- Task-specific embeddings — Some models are optimized for specific tasks: classification, clustering, retrieval, or semantic textual similarity (STS). General-purpose embedders work for most use cases, but check the model card if you have a specialized need.
How to Choose a Free Embedding Model
Embedding model selection depends on your retrieval workflow:
- Building a RAG system? → Prioritize retrieval quality. Cohere Embed (via free trial) and NVIDIA NV-Embed are strong choices. Use 768+ dimensions for good recall.
- Semantic search over documents? → Max input tokens matters — you want to embed large chunks without splitting mid-paragraph. NVIDIA NIM embedding models typically support 512–2048 token inputs.
- Multilingual search? → Cohere embed-multilingual-v3 covers 100+ languages. Check if your languages are supported before committing.
- Lightweight / edge deployment? → Cloudflare Workers AI offers embedding models (bge-base-en-v1.5, 768 dims) with 10K free requests/day. Good for smaller projects.
- Classification or clustering? → General-purpose embedders work well. You can always fine-tune on your labeled data later.
Note: embedding APIs have different authentication and SDK setups than chat APIs. NVIDIA NIM uses the same API key for both. Cloudflare Workers AI uses a different endpoint pattern. Always check the model detail page for exact base URL and code examples.
Top Picks for Embedding
4B parameter embedding model, strong retrieval performance. 40 RPM, no daily cap.
Cohere: Embed v3 (free trial) CohereIndustry-leading embedding quality. Free trial tier with rate limits. Multilingual support.
BGE Base EN v1.5 Cloudflare Workers AI768 dimensions, solid retrieval. 10K free requests/day on Cloudflare.
NVIDIA: Llama Nemotron Embed VL 1B V2 (free) OpenRouterVision-language embedding model. Embed both text and images in the same vector space.
All Free Embedding Models
| Provider | Model | Context | Max Output | Modality | Rate Limit | Released | |
|---|---|---|---|---|---|---|---|
| Cohere | Embed 4 | 131K | 131K | 2,000 inputs/min | — | Details | |
| Cohere | Rerank 3.5 | 131K | 131K | 10 RPM | — | Details | |
| OpenRouter | NVIDIA: Llama Nemotron Embed VL 1B V2 (free) | 131K | 8K | See provider page | Feb 25, 2026 | Details |