Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 0.0%. 27 models compared on this task.
Build a single self-contained HTML file (`index.html`) that renders three overlapping square layers in a fixed stacking order, with no build step and no network calls (inline all CSS, no external fonts or scripts). The grader measures the rendered geometry and computed z-index, so the stacking order and offsets must be exact. Container: - A `<section id="stage">` that is `position: relative`, 400px wide, 300px tall. Three layers inside it, each `position: absolute`, 160px wide, 160px tall, border-radius 12px. IDs are required so each layer can be measured: - `id="base"` at left 0px, top 0px, z-index 1, background #1e293b - `id="mid"` at left 60px, top 60px, z-index 5, background #6d28d9 - `id="front"` at left 120px, top 120px, z-index 9, background #f59e0b The numeric z-index values and the px offsets are the contract. Do not eyeball them.