Every RAG tutorial uses a different vector database. ChromaDB here, Pinecone there, Weaviate somewhere else. How do you actually choose?
As data engineers, we evaluate databases all the time. Vector databases are no different. Let’s apply the same criteria we use for any data store.
The Contenders
We’ll compare four options that cover the spectrum:
- ChromaDB: Open source, embedded, great for development
- Pinecone: Managed service, production-ready, pay-per-use
- Weaviate: Open source, self-hosted or cloud, feature-rich
- pgvector: PostgreSQL extension, familiar, integrates with existing infrastructure
Evaluation Criteria
The same questions we ask about any database:
- Deployment model: Embedded, self-hosted, or managed?
- Scaling characteristics: How does it handle growth?
- Query performance: Latency at various scales?
- Operational complexity: What does it take to run in production?
- Cost: Total cost of ownership?
- Ecosystem integration: How well does it fit your stack?
ChromaDB
ChromaDB is the SQLite of vector databases. Embedded, zero-config, perfect for getting started.
import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("documents")
collection.add(
ids=["doc1", "doc2"],
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["First document", "Second document"]
)
results = collection.query(
query_embeddings=[[0.1, 0.2, ...]],
n_results=5
)
Strengths:
- Zero configuration
- Runs in-process
- Great for prototyping and testing
- Open source
Limitations:
- Single-node only
- No built-in replication
- Performance degrades past ~1M vectors
Best for: Development, testing, small-scale production (<1M vectors).
Pinecone
Pinecone is fully managed. You don’t run servers; you call an API.
import pinecone
pinecone.init(api_key="your-key", environment="us-west1-gcp")
index = pinecone.Index("documents")
index.upsert(
vectors=[
{"id": "doc1", "values": [0.1, 0.2, ...], "metadata": {"source": "web"}},
{"id": "doc2", "values": [0.3, 0.4, ...], "metadata": {"source": "pdf"}}
]
)
results = index.query(
vector=[0.1, 0.2, ...],
top_k=5,
include_metadata=True
)
Strengths:
- Zero operational overhead
- Scales automatically
- Low latency at any scale
- Metadata filtering
Limitations:
- Vendor lock-in
- Cost can grow quickly at scale
- Data leaves your infrastructure
Best for: Teams without infrastructure expertise, fast time-to-market.
Weaviate
Weaviate is a full-featured vector database you can self-host or use as a managed service.
import weaviate
client = weaviate.Client("http://localhost:8080")
client.schema.create_class({
"class": "Document",
"vectorizer": "none",
"properties": [
{"name": "content", "dataType": ["text"]},
{"name": "source", "dataType": ["string"]}
]
})
client.data_object.create(
{"content": "First document", "source": "web"},
"Document",
vector=[0.1, 0.2, ...]
)
results = client.query.get("Document", ["content", "source"]) \
.with_near_vector({"vector": [0.1, 0.2, ...]}) \
.with_limit(5) \
.do()
Strengths:
- Rich query language (GraphQL)
- Built-in vectorizers (optional)
- Horizontal scaling
- Hybrid search (vector + keyword)
Limitations:
- More complex to operate
- Steeper learning curve
- Resource-intensive
Best for: Teams with infrastructure expertise needing advanced features.
pgvector
pgvector adds vector search to PostgreSQL. If you already run Postgres, this might be all you need.
CREATE EXTENSION vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
INSERT INTO documents (content, embedding)
VALUES ('First document', '[0.1, 0.2, ...]');
SELECT content, embedding <-> '[0.1, 0.2, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;
Strengths:
- Uses existing PostgreSQL infrastructure
- ACID transactions
- Familiar SQL interface
- No new operational burden
Limitations:
- Performance ceiling lower than purpose-built solutions
- Limited to PostgreSQL scaling patterns
- Fewer vector-specific optimizations
Best for: Teams already running PostgreSQL, simpler use cases.
Benchmark Results
We ran a simple benchmark: insert 100K vectors, then query with varying batch sizes.
| Database | Insert Time | Query Latency (p50) | Query Latency (p99) |
|---|---|---|---|
| ChromaDB | 45s | 12ms | 35ms |
| Pinecone | 120s | 8ms | 15ms |
| Weaviate | 60s | 10ms | 25ms |
| pgvector | 90s | 18ms | 50ms |
These numbers are directional. Your results will vary based on hardware, vector dimensions, and query patterns.
Decision Framework
Choose ChromaDB if:
- You’re prototyping or building a demo
- Your dataset is under 1M vectors
- You want minimal setup
Choose Pinecone if:
- You need production reliability without ops investment
- Time-to-market is critical
- You’re okay with managed service costs
Choose Weaviate if:
- You need advanced features (hybrid search, GraphQL)
- You have infrastructure expertise
- You want self-hosted with cloud option
Choose pgvector if:
- You already run PostgreSQL
- Your queries combine vector search with relational data
- Simplicity trumps optimization
The Real Answer
Start with ChromaDB for development. It’s free, fast to set up, and good enough to validate your approach.
When you’re ready for production, your choice depends on your team:
- No dedicated infrastructure team? Pinecone.
- Strong infrastructure team? Weaviate or pgvector.
- Already invested in PostgreSQL? pgvector.
Don’t overthink it. You can migrate later. The vector database is rarely the bottleneck—your chunking strategy and embedding model matter more.
What’s Next
We’ve built a pipeline, versioned our prompts, and picked a database. But how do we know if it’s actually working? In the final post, we’ll add observability to trace every step of our LLM application.
This is Part 4 of the “Data Engineering Meets AI” series. Read Part 3: Airflow RAG Pipeline