Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: claude-opus-4-8 at 0.0%. 27 models compared on this task.
Build a single self-contained HTML file (`index.html`) that renders a segmented pill navigation, with no build step and no network calls (inline all CSS, no external fonts or scripts). The grader measures the rendered geometry and computed style, so match the spec exactly. Bar `<nav id="nav">`: - `display: flex`, 520px wide, 44px tall, flex `gap` exactly 12px, padding 6px, background #f1f5f9, border-radius 22px. Four pill buttons with IDs `p1`..`p4`, each 96px wide and 32px tall, border-radius 16px, font-size 14px. The first pill (`p1`) is the active one: background #6d28d9, colour #ffffff. The other three are inactive: transparent background, colour #475569. Pill labels: Overview, Activity, Reports, Settings. The 96px pill width, the 12px gap, and the #6d28d9 active-pill fill are the contract.