The numbers speak for themselves.

Herald doesn't optimize LLM calls. It eliminates them. 46k+ lines of deterministic code across 171 modules, ten specialist model seats, and a 5-stage cascade ensure that common queries never touch a model. The speedup isn't incremental—it's orders of magnitude.

%
Zero-inference turns (target)
Design target for interactions handled entirely by deterministic code. The LLM is never called on these paths.
ms
Target response latency
Design target for exact-match paths. Dictionary lookups and tool calls, not model inference.
Specialist model seats
Eight distinct models across ten seats. Sequential relay manages VRAM so all seats share a single consumer GPU.
%
Decision auditability
Every routing decision has a traceable code path across 171 modules and 46k+ lines of deterministic code.
Hallucinated tool calls
The LLM never selects tools. Code selects tools. Wrong tool calls are structurally impossible.

Response Latency: Herald vs Traditional

QueryTraditional (LLM-as-Brain)Herald (LLM-as-Renderer)Speedup
"What time is it"1-3s<30ms~100x
"What's my name"1-3s<15ms~100x
"Status"1-2s<5ms~200x
"Hello"0.5-1s<5ms~100x
"Review this code"5-30s5-30s (BG1)1x
"Research quantum computing"5-30s5-30s (BG1)1x
Heavy tasks (research, code review, vision) route to BG1 specialist seats and take comparable time in both architectures because they require actual model inference. Herald's advantage is on the 7085% of interactions that the 5-stage cascade resolves without calling a model at all.

The question isn't whether this works.

See how Herald stacks up against Claude CLI and ChatGPT.