Banja Lab / Benchmarks / Test

SCREE-0003Websites · hard

Reproduce an editorial blog article page from its screenshot

The same task, run on 27 models. Compare the outputs side by side, or open any one in a popup to inspect it.

Top result: grok-composer-2.5-fast (default reasoning) at 100.0% composite. Lowest: deepseek-v4-flash at 0.0%. 27 models compared on this task.

How it ran

Each model was given the brief below in a fresh, isolated session with no access to our tools, and returned a single self-contained index.html (inline CSS and JS, no external requests, no build step).
The rendered output was scored 1 to 5 on brief fidelity, visual design, craft, and impact by a four-family vision panel - Anthropic (Claude Opus 4.8), OpenAI (GPT-5.5), Google (Gemini 3.1 Pro), and xAI (Grok 4.3) - using one identical prompt so the scores compare. The published judge score is leave-one-family-out: a model is never scored by a judge of its own family, so same-family self-preference is removed.

The brief

You are given a reference screenshot of an editorial blog article page. Reproduce it as faithfully as you can as ONE self-contained HTML file (`index.html`) that renders with no build step and no network calls (inline all CSS, no external fonts, scripts, or images). Match what the screenshot shows: - a warm off-white page with serif body text, - a top masthead bar: the rust-red "The Ledger" word-mark on the left and a nav on the right with the links Latest, Engineering, Culture, About, - a centred article column with a small rust-red "ENGINEERING" category label, then a large serif headline "How we cut build times in half without changing the stack", - a byline row with a small round avatar and the text "By Dana Okoro - 14 June 2026 - 6 min read", - a wide rounded hero banner (a warm orange gradient block) below the byline, - article body paragraphs, a sub-heading "Start by measuring, not guessing", and a pull-quote (blockquote) styled with a rust-red left border. Keep the warm editorial palette, the centred single-column article layout, the serif headline, and the hero banner close to the screenshot. The page must stay readable when narrowed.

grok-composer-2.5-fast

default reasoning

grok-composer-2.5-fast rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 100.0%

Open

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-4-6

High reasoning

claude-sonnet-4-6 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 93.7%

Open

Composite 93.7%Objective 93.7%

Open output Full run

claude-fable-5

High reasoning

claude-fable-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 93.6%

Open

Composite 93.6%Objective 93.6%

Open output Full run

kimi-k2.7-code

default reasoning

kimi-k2.7-code rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 93.4%

Open

Composite 93.4%Objective 93.4%

Open output Full run

claude-sonnet-4-6

High reasoning

claude-sonnet-4-6 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 92.4%

Open

Composite 92.4%Objective 92.4%

Open output Full run

claude-opus-4-8

High reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 92.2%

Open

Composite 92.2%Objective 92.2%

Open output Full run

deepseek-v4-pro

default reasoning

deepseek-v4-pro rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.9%

Open

Composite 91.9%Objective 91.9%

Open output Full run

claude-opus-4-8

Low reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.7%

Open

Composite 91.7%Objective 91.7%

Open output Full run

gemini-3.1-pro-preview

High reasoning

gemini-3.1-pro-preview rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.5%

Open

Composite 91.5%Objective 91.5%

Open output Full run

grok-4.3

default reasoning

grok-4.3 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.5%

Open

Composite 91.5%Objective 91.5%

Open output Full run

claude-sonnet-5

High reasoning

claude-sonnet-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.4%

Open

Composite 91.4%Objective 91.4%

Open output Full run

claude-sonnet-5

High reasoning

claude-sonnet-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.4%

Open

Composite 91.4%Objective 91.4%

Open output Full run

claude-opus-4-8

Extra-high reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 91.1%

Open

Composite 91.1%Objective 91.1%

Open output Full run

claude-opus-4-8

High reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 90.8%

Open

Composite 90.8%Objective 90.8%

Open output Full run

claude-fable-5

High reasoning

claude-fable-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 90.7%

Open

Composite 90.7%Objective 90.7%

Open output Full run

claude-opus-4-8

Medium reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 90.7%

Open

Composite 90.7%Objective 90.7%

Open output Full run

claude-haiku-4-5

default reasoning

claude-haiku-4-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 90.3%

Open

Composite 90.3%Objective 90.3%

Open output Full run

grok-build-0.1

default reasoning

grok-build-0.1 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 90.0%

Open

Composite 90.0%Objective 90.0%

Open output Full run

glm-5.2

default reasoning

glm-5.2 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 89.9%

Open

Composite 89.9%Objective 89.9%

Open output Full run

gemini-3.5-flash

default reasoning

gemini-3.5-flash rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 89.5%

Open

Composite 89.5%Objective 89.5%

Open output Full run

grok-4.20-reasoning

default reasoning

grok-4.20-reasoning rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 87.2%

Open

Composite 87.2%Objective 87.2%

Open output Full run

claude-opus-4-8

Max reasoning

claude-opus-4-8 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 87.1%

Open

Composite 87.1%Objective 87.1%

Open output Full run

gpt-5.5

High reasoning

gpt-5.5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 85.0%

Open

Composite 85.0%Objective 85.0%

Open output Full run

gpt-5.4-mini

High reasoning

gpt-5.4-mini rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 81.9%

Open

Composite 81.9%Objective 81.9%

Open output Full run

claude-haiku-4-5

High reasoning

claude-haiku-4-5 rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 0.0%

Open

Composite 0.0%Objective 0.0%

Open output Full run

gemini-3.1-flash-lite

default reasoning

gemini-3.1-flash-lite rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 0.0%

Open

Composite 0.0%Objective 0.0%

Open output Full run

deepseek-v4-flash

default reasoning

deepseek-v4-flash rendering of the Reproduce an editorial blog article page from its screenshot benchmark - composite 0.0%

Open

Composite 0.0%Objective 0.0%

Open output Full run