Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: grok-4.3 at 0.0%. 27 models compared on this task.
Build a single self-contained HTML file (`index.html`) that renders a horizontal toolbar, with no build step and no network calls (inline all CSS, no external fonts or scripts). The grader measures the rendered flex gap and the button box, so the numbers must be exact, not close. Bar: - A `<nav id="bar">` that is `display: flex`, 600px wide, 56px tall, with a flex `gap` of exactly 16px, vertically centred items, background #f1f5f9, radius 10px. Three buttons with IDs `b1`, `b2`, `b3`, each exactly 120px wide and 36px tall, border-radius 8px, font-size 15px, background #6d28d9, colour #ffffff. Put a flexible spacer between the second and third button so the third button sits at the right end. The 16px gap, the 120px button width, and the 15px font-size are the contract.