Banja Lab / Benchmarks / Test

AYGAT-0006UI components · hard

Keyboard-operable ARIA radiogroup with roving tabindex

The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.

Top result: claude-opus-4-8 (max reasoning) at 100.0% composite. Lowest: gemini-3.5-flash at 93.4%. 27 models compared on this task.

How it ran

Each model was given the brief below in a fresh, isolated session with no access to our tools, and returned a single self-contained index.html (inline CSS and JS, no external requests, no build step).
The rendered output was scored 1 to 5 on brief fidelity, visual design, craft, and impact by a four-family vision panel - Anthropic (Claude Opus 4.8), OpenAI (GPT-5.5), Google (Gemini 3.1 Pro), and xAI (Grok 4.3) - using one identical prompt so the scores compare. The published judge score is leave-one-family-out: a model is never scored by a judge of its own family, so same-family self-preference is removed.

The brief

Build a single self-contained page as one HTML file (`index.html`) that renders with no build step and no network calls (inline all CSS and JS, no external fonts or scripts). Build a custom radio group following the WAI-ARIA radiogroup pattern (do NOT use native <input type="radio">). Requirements: - A role="radiogroup" containing exactly three role="radio" options with the ids radio-standard, radio-express, radio-overnight. Exactly one option is checked on load (aria-checked="true"); only the checked option has tabindex="0" and the others have tabindex="-1" (roving tabindex, so the whole group is a single Tab stop). - When a radio has focus, ArrowDown or ArrowRight moves selection to the next option and ArrowUp or ArrowLeft to the previous (wrapping at the ends). Moving selection updates aria-checked (the newly selected option becomes "true", the previous "false"), moves tabindex, and moves DOM focus to the newly selected option. The radio group must be fully operable with the keyboard alone. Use plain, accessible markup.

claude-opus-4-8

Max reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-5

High reasoning

claude-sonnet-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

claude-fable-5

High reasoning

claude-fable-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

gpt-5.5

High reasoning

gpt-5.5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

gpt-5.4-mini

High reasoning

gpt-5.4-mini rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

deepseek-v4-pro

default reasoning

deepseek-v4-pro rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 99.1%

Open

Composite 99.1%Objective 99.1%

Open output Full run

claude-opus-4-8

Low reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-opus-4-8

Medium reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-haiku-4-5

High reasoning

claude-haiku-4-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

grok-composer-2.5-fast

default reasoning

grok-composer-2.5-fast rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-sonnet-5

High reasoning

claude-sonnet-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-fable-5

High reasoning

claude-fable-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-haiku-4-5

default reasoning

claude-haiku-4-5 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 98.1%

Open

Composite 98.1%Objective 98.1%

Open output Full run

claude-sonnet-4-6

High reasoning

claude-sonnet-4-6 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

kimi-k2.7-code

default reasoning

kimi-k2.7-code rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

gemini-3.1-pro-preview

High reasoning

gemini-3.1-pro-preview rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

gemini-3.1-flash-lite

default reasoning

gemini-3.1-flash-lite rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

grok-4.3

default reasoning

grok-4.3 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

grok-build-0.1

default reasoning

grok-build-0.1 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

claude-opus-4-8

High reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

deepseek-v4-flash

default reasoning

deepseek-v4-flash rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 97.2%

Open

Composite 97.2%Objective 97.2%

Open output Full run

claude-opus-4-8

Extra-high reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 96.3%

Open

Composite 96.3%Objective 96.3%

Open output Full run

claude-opus-4-8

High reasoning

claude-opus-4-8 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 94.4%

Open

Composite 94.4%Objective 94.4%

Open output Full run

glm-5.2

default reasoning

glm-5.2 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 94.4%

Open

Composite 94.4%Objective 94.4%

Open output Full run

grok-4.20-reasoning

default reasoning

grok-4.20-reasoning rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 94.4%

Open

Composite 94.4%Objective 94.4%

Open output Full run

claude-sonnet-4-6

High reasoning

claude-sonnet-4-6 rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 94.4%

Open

Composite 94.4%Objective 94.4%

Open output Full run

gemini-3.5-flash

default reasoning

gemini-3.5-flash rendering of the Keyboard-operable ARIA radiogroup with roving tabindex benchmark - composite 93.4%

Open

Composite 93.4%Objective 93.4%

Open output Full run