Every RAG tutorial uses a different vector database. ChromaDB here, Pinecone there, Weaviate somewhere else. How do you actually choose?

As data engineers, we evaluate databases all the time. Vector databases are no different. Let’s apply the same criteria we use for any data store.

The Contenders

We’ll compare four options that cover the spectrum:

  • ChromaDB: Open source, embedded, great for development
  • Pinecone: Managed service, production-ready, pay-per-use
  • Weaviate: Open source, self-hosted or cloud, feature-rich
  • pgvector: PostgreSQL extension, familiar, integrates with existing infrastructure

Evaluation Criteria

The same questions we ask about any database:

  1. Deployment model: Embedded, self-hosted, or managed?
  2. Scaling characteristics: How does it handle growth?
  3. Query performance: Latency at various scales?
  4. Operational complexity: What does it take to run in production?
  5. Cost: Total cost of ownership?
  6. Ecosystem integration: How well does it fit your stack?

ChromaDB

ChromaDB is the SQLite of vector databases. Embedded, zero-config, perfect for getting started.

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("documents")

collection.add(
    ids=["doc1", "doc2"],
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["First document", "Second document"]
)

results = collection.query(
    query_embeddings=[[0.1, 0.2, ...]],
    n_results=5
)

Strengths:

  • Zero configuration
  • Runs in-process
  • Great for prototyping and testing
  • Open source

Limitations:

  • Single-node only
  • No built-in replication
  • Performance degrades past ~1M vectors

Best for: Development, testing, small-scale production (<1M vectors).

Pinecone

Pinecone is fully managed. You don’t run servers; you call an API.

import pinecone

pinecone.init(api_key="your-key", environment="us-west1-gcp")
index = pinecone.Index("documents")

index.upsert(
    vectors=[
        {"id": "doc1", "values": [0.1, 0.2, ...], "metadata": {"source": "web"}},
        {"id": "doc2", "values": [0.3, 0.4, ...], "metadata": {"source": "pdf"}}
    ]
)

results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=5,
    include_metadata=True
)

Strengths:

  • Zero operational overhead
  • Scales automatically
  • Low latency at any scale
  • Metadata filtering

Limitations:

  • Vendor lock-in
  • Cost can grow quickly at scale
  • Data leaves your infrastructure

Best for: Teams without infrastructure expertise, fast time-to-market.

Weaviate

Weaviate is a full-featured vector database you can self-host or use as a managed service.

import weaviate

client = weaviate.Client("http://localhost:8080")

client.schema.create_class({
    "class": "Document",
    "vectorizer": "none",
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "source", "dataType": ["string"]}
    ]
})

client.data_object.create(
    {"content": "First document", "source": "web"},
    "Document",
    vector=[0.1, 0.2, ...]
)

results = client.query.get("Document", ["content", "source"]) \
    .with_near_vector({"vector": [0.1, 0.2, ...]}) \
    .with_limit(5) \
    .do()

Strengths:

  • Rich query language (GraphQL)
  • Built-in vectorizers (optional)
  • Horizontal scaling
  • Hybrid search (vector + keyword)

Limitations:

  • More complex to operate
  • Steeper learning curve
  • Resource-intensive

Best for: Teams with infrastructure expertise needing advanced features.

pgvector

pgvector adds vector search to PostgreSQL. If you already run Postgres, this might be all you need.

CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

INSERT INTO documents (content, embedding)
VALUES ('First document', '[0.1, 0.2, ...]');

SELECT content, embedding <-> '[0.1, 0.2, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Strengths:

  • Uses existing PostgreSQL infrastructure
  • ACID transactions
  • Familiar SQL interface
  • No new operational burden

Limitations:

  • Performance ceiling lower than purpose-built solutions
  • Limited to PostgreSQL scaling patterns
  • Fewer vector-specific optimizations

Best for: Teams already running PostgreSQL, simpler use cases.

Benchmark Results

We ran a simple benchmark: insert 100K vectors, then query with varying batch sizes.

DatabaseInsert TimeQuery Latency (p50)Query Latency (p99)
ChromaDB45s12ms35ms
Pinecone120s8ms15ms
Weaviate60s10ms25ms
pgvector90s18ms50ms

These numbers are directional. Your results will vary based on hardware, vector dimensions, and query patterns.

Decision Framework

Choose ChromaDB if:

  • You’re prototyping or building a demo
  • Your dataset is under 1M vectors
  • You want minimal setup

Choose Pinecone if:

  • You need production reliability without ops investment
  • Time-to-market is critical
  • You’re okay with managed service costs

Choose Weaviate if:

  • You need advanced features (hybrid search, GraphQL)
  • You have infrastructure expertise
  • You want self-hosted with cloud option

Choose pgvector if:

  • You already run PostgreSQL
  • Your queries combine vector search with relational data
  • Simplicity trumps optimization

The Real Answer

Start with ChromaDB for development. It’s free, fast to set up, and good enough to validate your approach.

When you’re ready for production, your choice depends on your team:

  • No dedicated infrastructure team? Pinecone.
  • Strong infrastructure team? Weaviate or pgvector.
  • Already invested in PostgreSQL? pgvector.

Don’t overthink it. You can migrate later. The vector database is rarely the bottleneck—your chunking strategy and embedding model matter more.

What’s Next

We’ve built a pipeline, versioned our prompts, and picked a database. But how do we know if it’s actually working? In the final post, we’ll add observability to trace every step of our LLM application.


This is Part 4 of the “Data Engineering Meets AI” series. Read Part 3: Airflow RAG Pipeline