claude-opus-4-8 (Anthropic) scored 81.8% composite across 87 tasks - code, UI, full websites, SVG, marketing pages, dashboards, animations, and Australian legal and accounting. Graded by execution, and the visual builds by a cross-family vision panel (leave-one-family-out). Run on 2026-06-24.
Composite score per domain, weakest first. Judge is the vision model’s read, shown for the visual domains.
The actual rendered output. Open any tile to view it in a popup, or compare the same task across every model.
Programming, Australian legal and accounting, graded by execution. 27 of 34 scored a perfect 100.0%; the rest are below. Open the answer in a popup, or compare it across every model.
| Task | Domain | Difficulty | Objective | pass@1 | Output |
|---|---|---|---|---|---|
| AUSFA-0011 | Australian accounting | hard | 66.7% | 0.0% | |
| AUSFA-0012 | Australian accounting | hard | 33.3% | 0.0% | |
| AUSFA-0013 | Australian accounting | hard | 66.7% | 0.0% | |
| AUSFA-0014 | Australian accounting | hard | 0.0% | 0.0% | |
| CODE-0004 | Programming | easy | 0.0% | 0.0% | |
| CODE-0006 | Programming | hard | 0.0% | 0.0% | |
| LAW-0003 | Australian law | hard | 0.0% | 0.0% |