Herald / Skeptic -- From Constrained LLM Renderer to Local Assistant Operating System

Performance

The numbers speak for themselves.

Herald doesn't optimize LLM calls. It eliminates them. 46k+ lines of deterministic code across 171 modules, ten specialist model seats, and a 5-stage cascade ensure that common queries never touch a model. The speedup isn't incremental—it's orders of magnitude.

—%

Zero-inference turns (target)

Design target for interactions handled entirely by deterministic code. The LLM is never called on these paths.

—ms

Target response latency

Design target for exact-match paths. Dictionary lookups and tool calls, not model inference.

—

Specialist model seats

Eight distinct models across ten seats. Sequential relay manages VRAM so all seats share a single consumer GPU.

—%

Decision auditability

Every routing decision has a traceable code path across 171 modules and 46k+ lines of deterministic code.

—✓

Hallucinated tool calls

The LLM never selects tools. Code selects tools. Wrong tool calls are structurally impossible.

Response Latency: Herald vs Traditional

Query	Traditional (LLM-as-Brain)	Herald (LLM-as-Renderer)	Speedup
"What time is it"	1-3s	<30ms	~100x
"What's my name"	1-3s	<15ms	~100x
"Status"	1-2s	<5ms	~200x
"Hello"	0.5-1s	<5ms	~100x
"Review this code"	5-30s	5-30s (BG1)	1x
"Research quantum computing"	5-30s	5-30s (BG1)	1x

Heavy tasks (research, code review, vision) route to BG1 specialist seats and take comparable time in both architectures because they require actual model inference. Herald's advantage is on the 70–85% of interactions that the 5-stage cascade resolves without calling a model at all.

The question isn't whether this works.

See how Herald stacks up against Claude CLI and ChatGPT.

Compare frameworks →Back to home