Ollama¶
The Ollama provider is locus pointed at a local model runtime. Ollama runs open-weight models on your laptop or a shared GPU box; locus calls it over HTTP exactly the way it would call OpenAI or Anthropic. No API key, no network egress, no per-token billing.
This is the right pick for offline development, reproducible tests, and iterating on agent design before you spend a dollar on hosted inference.
When to pick Ollama¶
| You want… | This is the right provider |
|---|---|
| To develop offline — laptop, plane, isolated network | ✓ |
| Reproducible tests — same prompt + seed → same output | ✓ |
| Cost-free agent iteration before swapping to a paid API | ✓ |
| Privacy-sensitive prototyping where data can't leave the machine | ✓ |
| A frontier model (GPT-5, Claude Opus 4) | use OpenAI or Anthropic |
| Production-scale concurrency | use OCI, OpenAI, Anthropic |
Getting started¶
1. Install Ollama and pull a model¶
Ollama itself isn't a Python package — it's a small binary that runs a local HTTP server.
# macOS (Homebrew) — or download from ollama.com
brew install ollama
# Start the server (it backgrounds itself):
ollama serve &
# Pull a model with native tool-calling support:
ollama pull llama3.3
ollama list will show what you've pulled. Anything in that list is
addressable from locus immediately.
2. Wire locus¶
That's it. No env vars, no auth — Ollama is local-first by default.
3. Run it¶
Done. Streaming and tool calling work the same as for any other provider — provided the model you pulled supports them.
What you get out of the box¶
Any pulled local model — no locus change needed¶
The model_id after ollama: is whatever appears in ollama list.
locus doesn't maintain an allow-list; if Ollama can run it, locus
can address it.
agent_a = Agent(model="ollama:llama3.3")
agent_b = Agent(model="ollama:qwen2.5-coder:32b")
agent_c = Agent(model="ollama:deepseek-r1:14b")
Real local streaming¶
Ollama emits SSE-shaped chunks; locus reads them as ModelChunkEvents
just like any other provider. Token-level streaming over localhost
is fast — typically <5 ms per chunk.
async for event in agent.run("Write a haiku about caching."):
if isinstance(event, ModelChunkEvent) and event.content:
print(event.content, end="", flush=True)
Tool calling — model-dependent¶
Ollama supports tool calling for models that emit it natively. As of writing:
| Model family | Tool calling |
|---|---|
llama3.1 / llama3.2 / llama3.3 |
✓ |
llama4 |
✓ |
qwen2.5 / qwen2.5-coder / qwen3 |
✓ |
mistral / mixtral |
✓ |
deepseek-r1 |
✓ (with reasoning) |
phi3 |
✗ — no native tool calling |
If a model doesn't support tool calling, the agent will still run —
it just won't be able to invoke any @tool you defined. The loop
then terminates after the first turn (no tools called, no follow-up
needed).
No auth — by design¶
Ollama listens on localhost:11434 with no authentication. That's
intentional for the local-first use case. To run against a shared
remote Ollama:
The same OllamaModel class talks to any HTTP-reachable Ollama
endpoint. (If you're exposing a remote Ollama, put it behind a VPN
or auth proxy yourself — Ollama doesn't ship one.)
Practical workflow — develop local, ship hosted¶
A common pattern: prototype an agent against Ollama for free, then swap one line to point at OCI / OpenAI / Anthropic for production.
# Development:
agent = Agent(model="ollama:llama3.3", tools=[...], system_prompt="...")
# Production — same agent, swap the model id:
agent = Agent(model="oci:openai.gpt-5.5", tools=[...], system_prompt="...")
Everything else — tools, hooks, checkpointers, termination, RAG — stays identical. You're not coupled to the local runtime; Ollama is just a model address.
Common gotchas¶
| Symptom | Likely cause |
|---|---|
Connection refused on localhost:11434 |
Ollama server isn't running. ollama serve & in another terminal. |
model 'X' not found |
Haven't pulled it yet. ollama pull X. |
| Slow first response after hours of idle | Ollama unloads models from VRAM after inactivity. The first call after a long pause re-loads (a few seconds). |
| Tool calls never fire | The model you pulled doesn't support tools (e.g. phi3). Switch to llama3.3 or qwen2.5. |
tool_calls parsed as text instead of structured |
Some Ollama versions emit XML-style <tool_call>{...}</tool_call> blocks. Update Ollama (brew upgrade ollama) or use a model with stable structured tool-call output. |
| Different output every run despite the same prompt | Set temperature=0 and pin seed in model_config. |
Source¶
OllamaModel in src/locus/models/native/ollama.py
See also¶
- Models overview — the full provider tree.
- OpenAI — GPT family direct.
- OCI Generative AI — production-scale OCI inference.