claude-sonnet-4-6 (Anthropic) scored 83.9% composite across 87 tasks - code, UI, full websites, SVG, marketing pages, dashboards, animations, and Australian legal and accounting. Graded by execution, and the visual builds by a cross-family vision panel (leave-one-family-out). Run on 2026-06-24.
Composite score per domain, weakest first. Judge is the vision model’s read, shown for the visual domains.
The actual rendered output. Open any tile to view it in a popup, or compare the same task across every model.
Programming, Australian legal and accounting, graded by execution. 29 of 34 scored a perfect 100.0%; the rest are below. Open the answer in a popup, or compare it across every model.
| Task | Domain | Difficulty | Objective | pass@1 | Output |
|---|---|---|---|---|---|
| AUSFA-0009 | Australian accounting | hard | 15.0% | 0.0% | |
| AUSFA-0011 | Australian accounting | hard | 66.7% | 0.0% | |
| AUSFA-0012 | Australian accounting | hard | 33.3% | 0.0% | |
| AUSFA-0013 | Australian accounting | hard | 66.7% | 0.0% | |
| AUSFA-0014 | Australian accounting | hard | 0.0% | 0.0% |