The Question Everyone Asks

“Which model should I use?”

It’s the most common question after “How do I install this?” The answer is: it depends on what you’re doing.

For Ingestion: Speed and Context

Ingestion is the most token-intensive operation. Two things matter:

Recommended:

DeepSeek V4-Flash — Lowest cost at $0.14/M tokens. Ideal for batch ingestion.
Gemini-3.5-Flash — 4× faster output than GPT-5.5.

Query operations are less token-intensive. Answer quality matters more than speed.

Recommended:

Start with DeepSeek for ingestion, switch to Claude for query. Best of both worlds.

Use Ollama for query, not ingestion. Local models have smaller context windows (8K–128K) — fine for querying, not for processing large sources.

Watch for rate limits. HTTP 429 errors → lower concurrency to 1–2, increase batch delay to 500–800ms.