Skip to content

Multi-modal providers

The model is one provider an agent depends on. Production agents pull from more: a web index, a page fetcher, an image renderer, a speech synthesiser. locus exposes those as a small set of Protocol types under locus.providers and an opt-in auto-registration step that turns each one into a model-callable tool.

from locus.agent import Agent
from locus.providers.web_fetch import HTTPXWebFetcher
from locus.providers.web_search import OpenAISearchPreviewProvider
from locus.providers.image import OpenAIImageProvider
from locus.providers.speech import OpenAISpeechProvider
from locus.models.native.openai import OpenAIModel

agent = Agent(
    model="openai:gpt-4o-mini",
    web_search=OpenAISearchPreviewProvider(OpenAIModel("gpt-4o-search-preview")),
    web_fetch=HTTPXWebFetcher(),
    image_generator=OpenAIImageProvider(model="dall-e-3"),
    speech_provider=OpenAISpeechProvider(),
)

Setting any of those four kwargs on Agent (or AgentConfig) registers a matching @tool:

Provider kwarg Auto-registered tool(s) Signature
web_search= web_search query: str, max_results: int = 5
web_fetch= web_fetch url: str, max_chars: int = 50000
image_generator= generate_image prompt: str, size: str = "1024x1024", n: int = 1
speech_provider= speak and/or transcribe depends on provider.capabilities

The model can call these alongside hand-written @tool functions — they share the same registry, the same idempotency machinery, the same hooks.

The protocols

Each provider is a one- or two-method typing.Protocol decorated with @runtime_checkable, so any duck-typed object that implements the methods is accepted. You don't need to subclass.

  • BaseWebSearchProvider: async search(query, max_results)list[SearchResult].
  • BaseWebFetchProvider: async fetch(url, max_chars, keep_html)WebPage.
  • BaseImageGenerationProvider: async generate(prompt, size, n)list[ImageResult].
  • BaseSpeechProvider: capabilities: frozenset[str] plus async speak(text, voice) and/or async transcribe(audio_bytes, content_type).

The shared Pydantic types live in locus.providers.types (SearchResult, WebPage) and beside each protocol (ImageResult, SynthesizedAudio, SpeechTranscript).

Built-in implementations

  • HTTPXWebFetcher — uses the httpx dep that's already in core, plus a stdlib HTMLParser shim that strips <script> / <style> and collapses whitespace. No beautifulsoup dep.
  • OpenAISearchPreviewProvider — wraps OpenAI's gpt-4o-search-preview chat-completions model. The model performs the retrieval itself and returns annotated results; the provider pins them through a strict JSON schema and returns a list of SearchResult.
  • OpenAIImageProviderimages.generate (dall-e-3 / gpt-image-1). Surfaces hosted URLs when the API returns them and base64 PNG bytes otherwise.
  • OpenAISpeechProvideraudio.speech.create (TTS, default tts-1) plus audio.transcriptions.create (Whisper, default whisper-1). Round-trips text → audio → text.

All four lazy-import openai / httpx so locus core stays free of optional dependencies until you actually wire one of these in.

Bring your own

The protocols are the contract — implement them and you're in. A production user might wrap Bing for search, trafilatura for fetch, OCI Vision for image generation, or OCI Speech for STT/TTS. The agent glue stays identical: set the kwarg on AgentConfig, locus registers the tool.

class BingSearch:
    async def search(self, query, *, max_results=5):
        ...  # call Bing, return list[SearchResult]

agent = Agent(
    model=...,
    web_search=BingSearch(),  # picked up via runtime_checkable Protocol
)

What this is not

  • Not a replacement for @tool. Hand-written tools still call your internal APIs and DBs. The provider registry is for the small set of modalities almost every agent needs.
  • Not multi-modal model wiring. This is capability wiring — the model itself is still text-in / text-out. If you want a vision model reading screenshots, configure that on the model side.
  • Not a multi-modal output channel. speak returns a tool-string summary so the model isn't fed raw audio bytes; the actual audio lives on the provider and your application code retrieves it from there when it's time to emit on a voice channel.

Source and tests

  • src/locus/providers/ — the four protocols, four implementations, and the auto_register() glue.
  • tests/unit/test_providers.py — runtime-checkable protocols, tool factories, AgentConfig wiring.
  • tests/integration/test_providers_live.py — live httpx fetch, live OpenAI search / image / speech (gated behind env vars).