Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: grok-composer-2.5-fast (default reasoning) at 100.0% composite. Lowest: claude-haiku-4-5 at 0.0%. 27 models compared on this task.
This is a benchmarking hypothetical, not legal advice. The facts are as at FY2025-26. A payroll officer asks whether Easter Saturday in 2026, the Saturday between Good Friday and Easter Sunday, is a public holiday for all employees across Australia. Answer correctly. Explain whether Easter Saturday is a public holiday uniformly across the country, or whether its status changes between jurisdictions. Be specific about whether the answer is uniform nationally, and about what an employee's answer turns on. Do not assert a single national answer that applies to everyone.