Model Seats: One Model, One Job

Esoteric v0.2 uses ten specialist model seats. No single model does everything. Each seat has a specific purpose, latency target, and activation condition.

This is the opposite of "one LLM to rule them all." The renderer speaks. The code specialist reasons about code. The vision specialist sees. The verifier judges. Code orchestrates all of them.

10
Model seats
2b-14b
Parameter range
4
BG1 specialists
6
Realtime seats

The Specialist Seats

Renderer
gemma4:e2b
Fallback: gemma4:e4b
Latency:<50ms
Fast realtime responses, simple routing, event watching
Majority of turns
Vision (Lite)
gemma4:e2b
Latency:<2s
Screen capture, OCR, visual understanding for quick tasks
On-demand visual queries
Vision (BG1)BG1
qwen3-vl:8b
Latency:Background
Heavy visual analysis, detailed image understanding
BG1 worker lane
Code SpecialistBG1
deepcoder:14b
Latency:Background
Code generation, debugging, analysis, refactoring (Pass 1)
BG1 worker lane
Code ReviewerBG1
rnj-1:8b
Latency:Background
Reviews and profiles deepcoder output (Pass 2 of Sequential Relay)
BG1 worker lane
Logic SpecialistBG1
deepseek-r1:8b
Latency:Background
Deep reasoning for math, programming logic, and research (300s timeout)
BG1 worker lane
Embedding
nomic-embed-text-v2-moe
Latency:<200ms
Semantic search, RAG retrieval, intent miss analysis
Memory enrichment
Verifier / Judge
gemma4:e4b
Latency:<100ms
Evidence checking, claim tagging via deterministic code (judge.py)
Pre-output verification
Speech-to-Text
small.en
Latency:<200ms
Local whisper inference for realtime voice ingress
Voice input lane
Text-to-Speech
Kokoro-82M
Latency:<500ms
High-quality neural TTS via ONNX runtime
Voice output lane

Seat Allocation Logic

Realtime Lane Priority
Renderer handles 70-85% of turns without calling other seats
BG1 Sequential Relay
Code generation (14b) followed by automated review (8b) for maximum reliability
VRAM Keepalive Strategy
2h keepalive for renderer/embedding; 0s for BG1 specialists to free VRAM immediately
Fallback Chain
Renderer preferred (e2b) → fallback (e4b) → deterministic text if all models unavailable
Latency Targets by Seat
Renderer
<50ms
Vision (Lite)
<2s
Vision (BG1)
Background
Code Specialist
Background
Code Reviewer
Background
Logic Specialist
Background
Embedding
<200ms
Verifier / Judge
<100ms
Speech-to-Text
<200ms
Text-to-Speech
<500ms

Renderer target is sub-50ms for instant responses. BG1 specialists run asynchronously with progress updates at 15%, 60%, 95%.

Model Seat Orchestration
Realtime Lane
Renderer (e2b/e4b)
Embedding
Verifier
BG1 Worker
Code (14b) → Review (8b)
Vision (8b)
Logic (8b)

The 5-stage dispatcher routes to realtime lane for quick tasks (70-85% of turns). Heavy tasks enter BG1 queue and activate specialist seats on-demand, including the sequential code relay and deep reasoning logic specialist.

The question isn't whether this works.

Herald is the pattern. Skeptic is the operating system. 10 model seats. Zero cloud dependency.