Ollama¶

The Ollama provider is locus pointed at a local model runtime. Ollama runs open-weight models on your laptop or a shared GPU box; locus calls it over HTTP exactly the way it would call OpenAI or Anthropic. No API key, no network egress, no per-token billing.

This is the right pick for offline development, reproducible tests, and iterating on agent design before you spend a dollar on hosted inference.

When to pick Ollama¶

You want…	This is the right provider
To develop offline — laptop, plane, isolated network	✓
Reproducible tests — same prompt + seed → same output	✓
Cost-free agent iteration before swapping to a paid API	✓
Privacy-sensitive prototyping where data can't leave the machine	✓
A frontier model (GPT-5, Claude Opus 4)	use OpenAI or Anthropic
Production-scale concurrency	use OCI, OpenAI, Anthropic

Getting started¶

1. Install Ollama and pull a model¶

Ollama itself isn't a Python package — it's a small binary that runs a local HTTP server.

# macOS (Homebrew) — or download from ollama.com
brew install ollama

# Start the server (it backgrounds itself):
ollama serve &

# Pull a model with native tool-calling support:
ollama pull llama3.3

ollama list will show what you've pulled. Anything in that list is addressable from locus immediately.

2. Wire locus¶

from locus import Agent

agent = Agent(model="ollama:llama3.3", system_prompt="You are helpful.")

That's it. No env vars, no auth — Ollama is local-first by default.

3. Run it¶

result = agent.run_sync("Sum 7 plus 35 in one word.")
print(result.message)
# → '42.'

Done. Streaming and tool calling work the same as for any other provider — provided the model you pulled supports them.

What you get out of the box¶

Any pulled local model — no locus change needed¶

The model_id after ollama: is whatever appears in ollama list. locus doesn't maintain an allow-list; if Ollama can run it, locus can address it.

ollama list
# llama3.3:latest
# qwen2.5-coder:32b
# deepseek-r1:14b

agent_a = Agent(model="ollama:llama3.3")
agent_b = Agent(model="ollama:qwen2.5-coder:32b")
agent_c = Agent(model="ollama:deepseek-r1:14b")

Real local streaming¶

Ollama emits SSE-shaped chunks; locus reads them as ModelChunkEvents just like any other provider. Token-level streaming over localhost is fast — typically <5 ms per chunk.

async for event in agent.run("Write a haiku about caching."):
    if isinstance(event, ModelChunkEvent) and event.content:
        print(event.content, end="", flush=True)

Tool calling — model-dependent¶

Ollama supports tool calling for models that emit it natively. As of writing:

Model family	Tool calling
`llama3.1` / `llama3.2` / `llama3.3`	✓
`llama4`	✓
`qwen2.5` / `qwen2.5-coder` / `qwen3`	✓
`mistral` / `mixtral`	✓
`deepseek-r1`	✓ (with reasoning)
`phi3`	✗ — no native tool calling

If a model doesn't support tool calling, the agent will still run — it just won't be able to invoke any @tool you defined. The loop then terminates after the first turn (no tools called, no follow-up needed).

No auth — by design¶

Ollama listens on localhost:11434 with no authentication. That's intentional for the local-first use case. To run against a shared remote Ollama:

export OLLAMA_HOST=http://gpu-box.internal:11434

The same OllamaModel class talks to any HTTP-reachable Ollama endpoint. (If you're exposing a remote Ollama, put it behind a VPN or auth proxy yourself — Ollama doesn't ship one.)

Practical workflow — develop local, ship hosted¶

A common pattern: prototype an agent against Ollama for free, then swap one line to point at OCI / OpenAI / Anthropic for production.

# Development:
agent = Agent(model="ollama:llama3.3", tools=[...], system_prompt="...")

# Production — same agent, swap the model id:
agent = Agent(model="oci:openai.gpt-5.5", tools=[...], system_prompt="...")

Everything else — tools, hooks, checkpointers, termination, RAG — stays identical. You're not coupled to the local runtime; Ollama is just a model address.

Common gotchas¶

Symptom	Likely cause
`Connection refused` on `localhost:11434`	Ollama server isn't running. `ollama serve &` in another terminal.
`model 'X' not found`	Haven't pulled it yet. `ollama pull X`.
Slow first response after hours of idle	Ollama unloads models from VRAM after inactivity. The first call after a long pause re-loads (a few seconds).
Tool calls never fire	The model you pulled doesn't support tools (e.g. `phi3`). Switch to `llama3.3` or `qwen2.5`.
`tool_calls` parsed as text instead of structured	Some Ollama versions emit XML-style `<tool_call>{...}</tool_call>` blocks. Update Ollama (`brew upgrade ollama`) or use a model with stable structured tool-call output.
Different output every run despite the same prompt	Set `temperature=0` and pin `seed` in `model_config`.

Source¶

OllamaModel in src/locus/models/native/ollama.py