What Is Vector Search? A Plain-Language Guide
What is vector search, how it works, and why it powers smarter AI chatbots. Includes comparisons, real examples, and a guide to choosing the right setup.
If you've built a chatbot that answers questions from your own content, you've already depended on vector search — even if you didn't call it that. It's the layer that decides which chunk of your knowledge base is relevant to a question before an LLM writes the answer. Get it right and your bot sounds like an expert. Get it wrong and it confidently hallucinates things your docs never said.
This guide explains what is vector search, how it differs from keyword search, why it's the backbone of modern retrieval-augmented generation (RAG), and what you actually need to know to choose and configure it well.
What is vector search, exactly?
Vector search is a method of finding information by meaning rather than by matching words.
Here's the core idea. Any piece of text — a sentence, a paragraph, a product description — can be converted into a list of numbers called an embedding (or vector). These numbers aren't arbitrary; they encode the semantic content of the text. Two sentences that mean the same thing, even if they use completely different words, end up with vectors that are close together in a high-dimensional space.
When a user asks a question, their question is also converted into a vector using the same embedding model. The search system then finds the stored text chunks whose vectors are closest to the question vector. Close in this mathematical sense means similar in meaning.
That's it. Vector search is proximity search on meaning-space representations of text.
The practical payoff: a user can ask "how do I cancel my plan?" and retrieve a document titled "Subscription termination process" — even though those phrases share zero words. This is what separates a genuinely useful support bot from one that constantly says "I'm sorry, I couldn't find that in my knowledge base."
Why this matters for AI chatbots
Before RAG existed as a pattern, chatbots either hallucinated facts or required engineers to hardcode every possible question and answer. Neither scales. Vector search changed the equation: you feed a model your actual content, and every time someone asks a question the system retrieves the most relevant passage before generating a response. The LLM answers from retrieved evidence instead of guessing.
The accuracy of your chatbot is largely determined not by which LLM you pick, but by how well your retrieval layer surfaces the right content. That retrieval layer is vector search.
How it differs from keyword (lexical) search
Traditional keyword search — think Ctrl+F, or classic SQL LIKE queries, or early Elasticsearch — finds documents that contain the same words you typed. It's fast, deterministic, and works fine when you know exactly what words to look for.
The problem is that language doesn't work that way. Users ask questions in their own words. Your documentation uses the official terminology. The gap between the two is where keyword search fails.
| | Keyword search | Vector search |
|---|---|---|
| Matches on | Exact terms | Meaning / semantic similarity |
| Handles synonyms | Only with synonyms lists | Natively |
| Handles typos | Poorly (needs fuzzy config) | Often fine (embedded similarly) |
| Speed | Very fast | Fast (with ANN index) |
| Cold-start accuracy | High | Depends on embedding model quality |
| Infrastructure | Simple (any DB) | Needs embedding model + vector store |
| Best for | Exact product codes, names, IDs | Natural-language questions, Q&A |
Neither is universally better. A common production pattern is hybrid search: run both in parallel, then merge results using a ranking formula (often called reciprocal rank fusion). You get keyword precision and semantic recall.
Where keyword search still wins
There are categories of query where keyword search is more reliable: exact product SKUs, invoice numbers, proper names, or any verbatim-match case. An e-commerce search for "SKU-9923-B" should go to keyword search. A support question like "why did my payment fail?" should go to vector search. Many production systems route queries based on query characteristics, then merge results.
The anatomy of a vector search system
Understanding what is vector search is easier when you break the system into its four components.
1. The embedding model
This converts text into vectors. The most important rule: every chunk in your database and every incoming query must use the same model. Mixing models silently breaks retrieval because the mathematical spaces are incompatible.
Embedding models are available as hosted APIs or as open-source models you run on your own infrastructure. Hosted APIs are easiest to start with; self-hosted models make sense when you need data privacy, cost control at high volume, or fine-tuning on domain-specific language.
The model determines your vector dimension — typically 768 to 3072 numbers per chunk. Higher dimensions can capture more nuance but require more storage and slightly more compute at query time.
2. The vector store (database)
Stored vectors need an index that enables fast similarity lookups. Options include:
- pgvector — a PostgreSQL extension; your existing Postgres database becomes a vector store. Great starting point for most teams.
- Pinecone — fully managed, no infra to run, expensive at scale
- Qdrant / Weaviate / Milvus — self-hosted or managed, built specifically for vectors
- Chroma — lightweight, open source, popular in prototypes
- Redis / Elasticsearch / OpenSearch — vector support added to existing stores
For most early-stage RAG products, pgvector on your existing database is fine. You already know Postgres, migrations work normally, and you don't need another service. When your collection grows to tens of millions of vectors and latency becomes a concern, evaluate dedicated stores.
3. The similarity metric
When comparing vectors, you need a distance (or similarity) function. The three common choices:
- Cosine similarity — measures the angle between vectors. The standard for text embeddings; unaffected by vector magnitude.
- Dot product — faster to compute, equivalent to cosine when vectors are unit-normalized (most embedding APIs return normalized vectors).
- Euclidean distance (L2) — measures absolute distance. Less common for text; use when your embedding model's documentation recommends it.
Pick whatever your embedding model's documentation recommends. For nearly all text embedding models, cosine similarity is the right default.
4. The ANN index
Searching millions of vectors by brute-force comparison is quadratic — you can't do it at query time in under 100ms. Approximate Nearest Neighbor (ANN) indexing trades a tiny bit of recall accuracy for orders-of-magnitude speed improvement.
The most common algorithm is HNSW (Hierarchical Navigable Small World). pgvector supports it. Pinecone, Qdrant, and most dedicated stores default to it. The trade-offs you tune:
- `ef_construction` (pgvector) /
ef— higher means better index quality, slower build - `m` — number of bidirectional links per node; higher means better recall, more memory
- `ef_search` — higher means more accurate query-time results, slower queries
Reasonable defaults get you to roughly 95% recall. For a support bot serving thousands of requests per day, that's more than enough.
What is vector search doing inside a RAG chatbot?
Retrieval-augmented generation (RAG) is the architecture that makes an AI chatbot accurate on your specific content. Vector search is the retrieval layer.
The sequence on every user message:
- Embed the query. The question goes through the embedding model, producing a vector.
- Similarity search. The vector store returns the top-k chunks most similar to the query vector.
- Assemble context. Retrieved chunks (usually 3–8) are assembled into a prompt.
- Generate. An LLM receives the context plus the original question and writes a grounded answer.
- Cite. Good systems include source references so users can verify.
Step 2 is exactly vector search. A hallucination in a RAG chatbot is almost always a retrieval failure — the right chunk wasn't returned, so the LLM filled the gap with a plausible-sounding answer that wasn't in your docs.
The retrieval-quality chain
Think of it as a chain: better chunking → better embeddings → better retrieval → better answers. Each link matters, but retrieval runs on every single user query. Teams that measure recall regularly catch problems that teams running blind only discover when users complain.
Start free at aleeup.com — Alee handles the embedding pipeline, vector indexing, and retrieval tuning automatically, so you can train a chatbot on your content without building any of this infrastructure yourself.
Chunking strategy: the decision that matters most
Before you can store or search vectors, you need to split your source documents into chunks. This is where most teams make their biggest mistake.
Chunk too large: the retrieved passage contains the answer and a lot of noise. The LLM has to reason over irrelevant content; answers get diluted.
Chunk too small: the answer spans multiple chunks, and your retrieval only surfaces one of them. The LLM writes an incomplete answer.
Common chunking strategies:
- Fixed-size with overlap — split every N tokens, overlap by M tokens between adjacent chunks. Simple, predictable, decent results. Common starting point: 400–600 tokens, 50–100 overlap.
- Sentence-window chunking — embed individual sentences, but at retrieval time return the surrounding window of sentences. Better semantic boundaries.
- Recursive character splitting — split at paragraph breaks, then sentence breaks, then character count. Respects natural document structure.
- Semantic chunking — compute embeddings for sentences, then split wherever consecutive sentence embeddings diverge sharply. More computationally expensive but highest quality.
Matching chunk strategy to document type
The best chunking strategy depends on your source content:
- Dense technical documentation (API references, spec sheets) — smaller chunks, around 300–400 tokens, so answers don't get buried inside long explanations.
- FAQ pages and help articles — paragraph-level chunking works well; each paragraph usually covers a single topic.
- Long-form guides or tutorials — sentence-window chunking keeps semantic context tight while embedding at a granular level.
- PDFs with tables or structured data — extract tables separately as structured text; cells embedded inside a paragraph produce poor vectors.
Chunking decisions made early have an outsized effect on answer quality for the lifetime of the product. See the tutorials section for worked examples.
How to evaluate vector search quality
You can't tune what you can't measure. Before going to production, build a small evaluation set:
- Write 20–50 representative questions your users will ask.
- Manually identify the chunk(s) containing the correct answer for each question.
- Run retrieval and check whether the correct chunk appears in the top-k results.
- Calculate recall@k and mean reciprocal rank.
A recall@5 above 0.85 is a reasonable bar before connecting an LLM. If you're below that, the problem is usually chunking (too large, bad splits), embedding model mismatch, or missing content.
Building a retrieval regression suite
Keep that evaluation set around and run it after every significant change — new content source, updated embedding model, revised chunking. Retrieval regressions are easy to introduce and hard to notice without structured measurement. A simple spreadsheet tracking recall@5 per release takes an hour to set up.
If you're building on Alee, the resources section has templates for building evaluation sets without writing measurement infrastructure from scratch.
Common mistakes teams make with vector search
Using the wrong embedding model for their language
Most teams building for English users are fine with general-purpose embedding models. Teams building for Hindi, Tamil, Arabic, or other languages need to check multilingual model benchmarks explicitly. A model that performs well on English may produce poor embeddings for other scripts, quietly degrading retrieval without obvious errors.
Re-embedding on every deploy
If your embedding model is the same, you don't need to re-embed existing content on every deploy. Store your vectors in a persistent database, track which chunks have been embedded, and only embed new or changed content. Re-embedding at scale is expensive and slow.
Ignoring metadata filtering
Vector search returns the most semantically similar chunks across your entire knowledge base. If you have multiple chatbots or multiple document sets in one store, you need metadata filters to scope retrieval to the right namespace. This is basic but easy to skip in early prototypes and painful to retrofit later.
Setting k too high or too low
Retrieving k=1 is almost never right — the closest vector might be slightly off. Retrieving k=20 floods the LLM with context and risks burying the relevant chunk. For most RAG systems, k=4 to k=8 is a practical starting point. Tune based on your evaluation set.
Not caching frequent queries
Popular questions asked repeatedly don't need to go through the full retrieval-and-generation pipeline every time. Cache the final generated answer (keyed on a normalized version of the question) and serve it instantly. This cuts latency and LLM costs significantly on high-traffic bots. See how caching works in Alee for a production example.
Neglecting content freshness
Vector search retrieves from what's indexed. If your documentation changes and you don't re-embed updated pages, your bot answers from stale content. Build an ingestion refresh schedule — weekly for stable docs, daily for fast-moving products. See how Alee's automatic re-crawl compares to manual pipelines on the compare page.
Hybrid search: when to combine keyword and vector
Pure vector search occasionally misses things it should catch easily — especially exact phrases, product codes, model numbers, or proper nouns that exist verbatim in your documents but whose embeddings diverge because the surrounding context differs.
Hybrid search solves this by running BM25 (keyword) and vector search in parallel, then merging ranked results. The merge step is usually reciprocal rank fusion (RRF):
```
score = Σ 1 / (k + rank_i)
```
where rank_i is the position in each result list and k is a constant (60 is common). RRF is parameter-light and works well without tuning.
Most production RAG systems worth their salt use hybrid search. pgvector paired with Postgres full-text search is a clean way to implement it without adding a second data store. Dedicated vector databases like Weaviate and Qdrant have hybrid search built in.
When pure vector search is enough
If your knowledge base is primarily natural-language content — support articles, FAQs, long-form guides — pure vector search performs well and is simpler to maintain. Add keyword search when users search for exact identifiers (order IDs, error codes, product names) and your answer quality drops. Monitor failed-retrieval logs; they'll tell you when hybrid becomes necessary.
Choosing a vector store: a practical comparison
| Store | Best fit | Managed option | Postgres-compatible |
|---|---|---|---|
| pgvector | Teams already on Postgres, <5M vectors | Via Supabase / Neon | Yes |
| Pinecone | Fast setup, no infra management | Yes (only) | No |
| Qdrant | Self-hosted, high-volume, open source | Yes (cloud) | No |
| Weaviate | Multi-modal, built-in hybrid search | Yes (cloud) | No |
| Chroma | Local dev, prototyping | No | No |
| Redis VSS | Low-latency, already using Redis | Via Redis Cloud | No |
Alee uses pgvector under the hood, which means your knowledge base lives in a battle-tested relational database with proper backups, migrations, and the ability to query both structured metadata and vectors in a single query. You can read more about the full architecture on the features page.
What vector search can't do
- It doesn't improve content that isn't there. If your knowledge base has gaps, vector search retrieves the least-wrong chunk. The fix is more content, not better retrieval.
- It's probabilistic, not exact. ANN indices can miss the best match at the edges of your index parameters. Brute-force exact search works for small collections (under 100k vectors) but doesn't scale.
- It can't handle structured queries. "Show me all customers who paid more than $500 in the last 30 days" is a SQL question, not a vector search question.
- Embeddings go stale. If you switch embedding models, all stored vectors become invalid. Plan model upgrades carefully.
- It won't compensate for a poorly designed prompt. Even perfect retrieval fails if the prompt doesn't instruct the LLM how to use the retrieved context. Retrieval and generation need to be co-designed.
How Alee uses vector search for your chatbot
When you add a source — a website URL, a sitemap, a PDF, a YouTube transcript — here's what happens:
- Content is extracted and cleaned.
- Split into chunks using a strategy tuned for the document type.
- Each chunk is embedded using a high-quality embedding model.
- Vectors are stored in a pgvector index with source metadata.
- On every chat message, the question is embedded, k-nearest chunks are retrieved, and an LLM answers with source citations.
- Repeat questions are served from cache instantly.
You don't configure any of this. Paste your URL, click train, and your bot is ready in minutes. Pricing starts at free — see the plans or explore what's possible before you commit.
Key takeaways
- What is vector search: a method of finding information by semantic similarity using mathematical representations (vectors) of text, rather than keyword matching.
- Embeddings convert text into vectors; similar meanings produce vectors that are mathematically close.
- Vector search is the retrieval layer inside every RAG chatbot — it determines which chunks of your content an LLM sees before answering.
- Chunking strategy is the most impactful variable in retrieval quality; test fixed-size, sentence-window, and semantic chunking approaches.
- For most teams, pgvector on Postgres is the right starting point; specialized vector databases make sense at high scale.
- Hybrid search (vector + keyword) outperforms either alone for production use cases.
- Evaluate retrieval quality explicitly before connecting an LLM — recall@k on a manual question set catches problems early.
- Vector search can't fix missing content, structured queries, or stale embeddings from a switched model.
Ready to see vector search working in a real chatbot on your content? [Start free — no credit card, trained in minutes.](/signup)
---
Frequently asked questions
What is vector search vs semantic search?
The terms are often used interchangeably. Strictly speaking, "semantic search" describes the goal (finding results by meaning), while "vector search" describes the implementation mechanism (similarity search over dense vector embeddings). In practice, when someone says "semantic search" in a technical context, they almost always mean a system built on vector search.
Do I need a specialized vector database, or can I use Postgres?
For most products up to a few million vectors, pgvector on Postgres is entirely sufficient. It handles concurrent reads, integrates naturally with your existing schema and migrations, and doesn't add operational complexity. Dedicated vector databases offer advantages at very high scale, multi-modal use cases, or when you need built-in hybrid search without custom plumbing.
How does chunking affect vector search accuracy?
Significantly. The embedding model encodes the meaning of each chunk as a single vector, so if a chunk is too long, its vector becomes a blur of several topics and retrieval precision drops. If it's too short, key context is split across chunks and individual chunks may not have enough signal. Start with 400–600 token chunks with 100-token overlap, then measure recall on real questions and adjust.
Can vector search handle languages other than English?
Yes, but you need to choose an embedding model trained on your target languages. General-purpose models trained mainly on English produce poor embeddings for other scripts. Several embedding models and open-source variants explicitly support Hindi, Arabic, Spanish, Japanese, and many other languages. If you're building for a multilingual audience, test models on your actual language mix before committing to one.
What's the difference between vector search and full-text search?
Full-text search (BM25, tsvector in Postgres, Elasticsearch) matches documents by the words they contain, ranked by term frequency and inverse document frequency. It's exact and fast. Vector search matches by meaning, handling synonyms, paraphrases, and related concepts that share no words. The best production RAG systems run both and merge results — this hybrid approach consistently outperforms either method alone on natural-language Q&A tasks.
Build your own AI chatbot with Alee
Train it on your site, embed it anywhere, capture leads 24/7. Free to start.