The things we build to answer our own questions: experiments, research, and tools, shared in the open. We start with model benchmarks, run on the same work we ship for clients - currently 28 models across 87 tasks, topped by gpt-5.5-pro at 94.1% composite.
More experiments land here as we build them.