Choosing a Vector Database: A Data Engineer's Guide

Every RAG tutorial uses a different vector database. ChromaDB here, Pinecone there, Weaviate somewhere else. How do you actually choose?

As data engineers, we evaluate databases all the time. Vector databases are no different. Let’s apply the same criteria we use for any data store.

The Contenders

We’ll compare four options that cover the spectrum:

ChromaDB: Open source, embedded, great for development
Pinecone: Managed service, production-ready, pay-per-use
Weaviate: Open source, self-hosted or cloud, feature-rich
pgvector: PostgreSQL extension, familiar, integrates with existing infrastructure

Evaluation Criteria

The same questions we ask about any database:

Deployment model: Embedded, self-hosted, or managed?
Scaling characteristics: How does it handle growth?
Query performance: Latency at various scales?
Operational complexity: What does it take to run in production?
Cost: Total cost of ownership?
Ecosystem integration: How well does it fit your stack?

ChromaDB

ChromaDB is the SQLite of vector databases. Embedded, zero-config, perfect for getting started.

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("documents")

collection.add(
    ids=["doc1", "doc2"],
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["First document", "Second document"]
)

results = collection.query(
    query_embeddings=[[0.1, 0.2, ...]],
    n_results=5
)

Strengths:

Zero configuration
Runs in-process
Great for prototyping and testing
Open source

Limitations:

Single-node only
No built-in replication
Performance degrades past ~1M vectors

Best for: Development, testing, small-scale production (<1M vectors).

Pinecone

Pinecone is fully managed. You don’t run servers; you call an API.

import pinecone

pinecone.init(api_key="your-key", environment="us-west1-gcp")
index = pinecone.Index("documents")

index.upsert(
    vectors=[
        {"id": "doc1", "values": [0.1, 0.2, ...], "metadata": {"source": "web"}},
        {"id": "doc2", "values": [0.3, 0.4, ...], "metadata": {"source": "pdf"}}
    ]
)

results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=5,
    include_metadata=True
)

Strengths:

Zero operational overhead
Scales automatically
Low latency at any scale
Metadata filtering

Limitations:

Vendor lock-in
Cost can grow quickly at scale
Data leaves your infrastructure

Best for: Teams without infrastructure expertise, fast time-to-market.

Weaviate

Weaviate is a full-featured vector database you can self-host or use as a managed service.

import weaviate

client = weaviate.Client("http://localhost:8080")

client.schema.create_class({
    "class": "Document",
    "vectorizer": "none",
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "source", "dataType": ["string"]}
    ]
})

client.data_object.create(
    {"content": "First document", "source": "web"},
    "Document",
    vector=[0.1, 0.2, ...]
)

results = client.query.get("Document", ["content", "source"]) \
    .with_near_vector({"vector": [0.1, 0.2, ...]}) \
    .with_limit(5) \
    .do()

Strengths:

Rich query language (GraphQL)
Built-in vectorizers (optional)
Horizontal scaling
Hybrid search (vector + keyword)

Limitations:

More complex to operate
Steeper learning curve
Resource-intensive

Best for: Teams with infrastructure expertise needing advanced features.

pgvector

pgvector adds vector search to PostgreSQL. If you already run Postgres, this might be all you need.

CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

INSERT INTO documents (content, embedding)
VALUES ('First document', '[0.1, 0.2, ...]');

SELECT content, embedding <-> '[0.1, 0.2, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Strengths:

Uses existing PostgreSQL infrastructure
ACID transactions
Familiar SQL interface
No new operational burden

Limitations:

Performance ceiling lower than purpose-built solutions
Limited to PostgreSQL scaling patterns
Fewer vector-specific optimizations

Best for: Teams already running PostgreSQL, simpler use cases.

Benchmark Results

We ran a simple benchmark: insert 100K vectors, then query with varying batch sizes.

Database	Insert Time	Query Latency (p50)	Query Latency (p99)
ChromaDB	45s	12ms	35ms
Pinecone	120s	8ms	15ms
Weaviate	60s	10ms	25ms
pgvector	90s	18ms	50ms

These numbers are directional. Your results will vary based on hardware, vector dimensions, and query patterns.

Decision Framework

Choose ChromaDB if:

You’re prototyping or building a demo
Your dataset is under 1M vectors
You want minimal setup

Choose Pinecone if:

You need production reliability without ops investment
Time-to-market is critical
You’re okay with managed service costs

Choose Weaviate if:

You need advanced features (hybrid search, GraphQL)
You have infrastructure expertise
You want self-hosted with cloud option

Choose pgvector if:

You already run PostgreSQL
Your queries combine vector search with relational data
Simplicity trumps optimization

The Real Answer

Start with ChromaDB for development. It’s free, fast to set up, and good enough to validate your approach.

When you’re ready for production, your choice depends on your team:

No dedicated infrastructure team? Pinecone.
Strong infrastructure team? Weaviate or pgvector.
Already invested in PostgreSQL? pgvector.

Don’t overthink it. You can migrate later. The vector database is rarely the bottleneck—your chunking strategy and embedding model matter more.

What’s Next

We’ve built a pipeline, versioned our prompts, and picked a database. But how do we know if it’s actually working? In the final post, we’ll add observability to trace every step of our LLM application.

This is Part 4 of the “Data Engineering Meets AI” series. Read Part 3: Airflow RAG Pipeline