What we measure,in the open.
Token optimization is only worth doing if it holds quality. We publish the benchmarks, the methods, and the failure cases so you can check the claims yourself.
BenchmarkMay 2026
Quality parity under aggressive context trimming
Across 1,200 coding tasks, trimmed prompts held a 99/100 parity score against full-context baselines.
MethodApr 2026
When routing to a lighter model is safe
A classifier for the cases where a smaller model matches the frontier one, and where it must not.
ReportMar 2026
Where the tokens actually go in a coding agent
A breakdown of a million-token bug hunt: re-reads, repeated history, and heavy-model overuse.
Methods and datasets are released alongside each post.