Banja Lab / Benchmarks / Test
The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.
Top result: claude-opus-4-8 (low reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 0.0%. 27 models compared on this task.
This is a benchmarking hypothetical, not tax advice. Figures and public-holiday dates are as at FY2025-26 (Australia). You are acting as the bookkeeper for a consultancy registered in Western Australia (WA). This is a SCRIPTED MULTI-TURN run for one quarter. You will be asked a sequence of questions, one per turn. Answer each turn on its own, and keep your earlier answers consistent across the whole run. Quarter figures (all sales and purchases are GST-taxable at the standard 10% rate, so each GST-inclusive amount contains one eleventh of GST): - Total sales including GST: $220,000 - Total purchases including GST: $33,000 - Gross wages paid to employees: $80,000 - PAYG withholding rate applied to those wages: 24% There are no PAYG instalments, fuel tax credits, or other amounts this quarter. Reminder: GST on a GST-inclusive amount is that amount divided by 11, never 10% added on top. Western Australia observes the King's Birthday on a different date from the eastern states - use the WA date.