Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: grok-composer-2.5-fast (default reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 0.0%. 27 models compared on this task.
This is a benchmarking hypothetical, not legal advice. The facts are as at FY2025-26. Anzac Day in 2026 falls on a Saturday. A payroll officer asks whether there is an extra substitute public holiday on the following Monday for all employees across Australia. Answer correctly. Explain whether the additional Monday public holiday applies right across the country, or just in particular jurisdictions. Be specific about whether the additional Monday is a uniform entitlement everywhere, and about what an employee's answer turns on. Do not claim a single national substitute Monday that applies to everyone.