Herald / Skeptic -- From Constrained LLM Renderer to Local Assistant Operating System

Multi-Model Architecture

Model Seats: One Model, One Job

Esoteric v0.2 uses ten specialist model seats. No single model does everything. Each seat has a specific purpose, latency target, and activation condition.

This is the opposite of "one LLM to rule them all." The renderer speaks. The code specialist reasons about code. The vision specialist sees. The verifier judges. Code orchestrates all of them.

Model seats

2b-14b

Parameter range

BG1 specialists

Realtime seats

The Specialist Seats

Renderer

gemma4:e2b

Fallback: gemma4:e4b

Latency:<50ms

Fast realtime responses, simple routing, event watching

Majority of turns

Vision (Lite)

gemma4:e2b

Latency:<2s

Screen capture, OCR, visual understanding for quick tasks

On-demand visual queries

Vision (BG1)BG1

qwen3-vl:8b

Latency:Background

Heavy visual analysis, detailed image understanding

BG1 worker lane

Code SpecialistBG1

deepcoder:14b

Latency:Background

Code generation, debugging, analysis, refactoring (Pass 1)

BG1 worker lane

Code ReviewerBG1

rnj-1:8b

Latency:Background

Reviews and profiles deepcoder output (Pass 2 of Sequential Relay)

BG1 worker lane

Logic SpecialistBG1

deepseek-r1:8b

Latency:Background

Deep reasoning for math, programming logic, and research (300s timeout)

BG1 worker lane

Embedding

nomic-embed-text-v2-moe

Latency:<200ms

Semantic search, RAG retrieval, intent miss analysis

Memory enrichment

Verifier / Judge

gemma4:e4b

Latency:<100ms

Evidence checking, claim tagging via deterministic code (judge.py)

Pre-output verification

Speech-to-Text

small.en

Latency:<200ms

Local whisper inference for realtime voice ingress

Voice input lane

Text-to-Speech

Kokoro-82M

Latency:<500ms

High-quality neural TTS via ONNX runtime

Voice output lane

Seat Allocation Logic

Realtime Lane Priority

Renderer handles 70-85% of turns without calling other seats

BG1 Sequential Relay

Code generation (14b) followed by automated review (8b) for maximum reliability

VRAM Keepalive Strategy

2h keepalive for renderer/embedding; 0s for BG1 specialists to free VRAM immediately

Fallback Chain

Renderer preferred (e2b) → fallback (e4b) → deterministic text if all models unavailable

Latency Targets by Seat

Renderer

<50ms

Vision (Lite)

<2s

Vision (BG1)

Background

Code Specialist

Background

Code Reviewer

Background

Logic Specialist

Background

Embedding

<200ms

Verifier / Judge

<100ms

Speech-to-Text

<200ms

Text-to-Speech

<500ms

Renderer target is sub-50ms for instant responses. BG1 specialists run asynchronously with progress updates at 15%, 60%, 95%.

Model Seat Orchestration

Realtime Lane

Renderer (e2b/e4b)

Embedding

Verifier

│

BG1 Worker

Code (14b) → Review (8b)

Vision (8b)

Logic (8b)

The 5-stage dispatcher routes to realtime lane for quick tasks (70-85% of turns). Heavy tasks enter BG1 queue and activate specialist seats on-demand, including the sequential code relay and deep reasoning logic specialist.

The question isn't whether this works.

Herald is the pattern. Skeptic is the operating system. 10 model seats. Zero cloud dependency.

Explore the architecture →Back to home