Banja Lab / Benchmarks / Test

AUSFA-0012Australian accounting · hard

Anzac Day 2026 substitute holiday trap

The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.

Top result: grok-composer-2.5-fast (default reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 0.0%. 27 models compared on this task.

How it ran

Each model was given the brief below in a fresh, isolated session with no access to our tools, and returned its answer from scratch.
The rendered output was scored 1 to 5 on brief fidelity, visual design, craft, and impact by a four-family vision panel - Anthropic (Claude Opus 4.8), OpenAI (GPT-5.5), Google (Gemini 3.1 Pro), and xAI (Grok 4.3) - using one identical prompt so the scores compare. The published judge score is leave-one-family-out: a model is never scored by a judge of its own family, so same-family self-preference is removed.

The brief

This is a benchmarking hypothetical, not legal advice. The facts are as at FY2025-26. Anzac Day in 2026 falls on a Saturday. A payroll officer asks whether there is an extra substitute public holiday on the following Monday for all employees across Australia. Answer correctly. Explain whether the additional Monday public holiday applies right across the country, or just in particular jurisdictions. Be specific about whether the additional Monday is a uniform entitlement everywhere, and about what an employee's answer turns on. Do not claim a single national substitute Monday that applies to everyone.

grok-composer-2.5-fast

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-5

High reasoning

Composite 66.7%Objective 66.7%

Open output Full run

claude-opus-4-8

Low reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-opus-4-8

Medium reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-opus-4-8

High reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-sonnet-4-6

High reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-fable-5

High reasoning

Composite 33.3%Objective 33.3%

Open output Full run

glm-5.2

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

kimi-k2.7-code

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

gemini-3.1-flash-lite

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

grok-4.3

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-fable-5

High reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-haiku-4-5

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

deepseek-v4-pro

default reasoning

Composite 33.3%Objective 33.3%

Open output Full run

claude-opus-4-8

Extra-high reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-opus-4-8

Max reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-sonnet-5

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-haiku-4-5

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

gpt-5.5

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

gpt-5.4-mini

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

gemini-3.1-pro-preview

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

gemini-3.5-flash

default reasoning

Composite 0.0%Objective 0.0%

Open output Full run

grok-4.20-reasoning

default reasoning

Composite 0.0%Objective 0.0%

Open output Full run

grok-build-0.1

default reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-opus-4-8

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-sonnet-4-6

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run

deepseek-v4-flash

default reasoning

Composite 0.0%Objective 0.0%

Open output Full run