Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 100.0%. 27 models compared on this task.
Build a single self-contained HTML file (`index.html`) that renders one stat card with no build step and no network calls (inline all CSS, no external fonts or scripts). Match this frozen spec EXACTLY. The grader measures the rendered getBoundingClientRect and getComputedStyle, so eyeballing it is not enough. Layout (IDs are required so the card and its parts can be measured): - A `<section id="card">` exactly 320px wide and 180px tall. - Inside the card: a value element `id="value"` and a label element `id="label"`. Token sheet (apply precisely): - card background colour: #6d28d9 - card border-radius: 16px - card padding: 24px - value font-size: 40px, colour #ffffff - label font-size: 14px, colour #c4b5fd - value text: 12,480 - label text: Active users this week Use plain, readable markup. The numbers are a contract, not a suggestion.