Banja Lab / Benchmarks / Test

CODE-0006Programming · hard

Arithmetic expression evaluator with precedence

The same task, run on 28 models. Compare the outputs side by side, or open any one in a popup to inspect it.

Top result: claude-opus-4-8 (high reasoning) at 100.0% composite. Lowest: claude-haiku-4-5 at 0.0%. 28 models compared on this task.

How it ran

Each model was given the brief below in a fresh, isolated session with no access to our tools, and returned its answer from scratch.
The rendered output was scored 1 to 5 on brief fidelity, visual design, craft, and impact by a four-family vision panel - Anthropic (Claude Opus 4.8), OpenAI (GPT-5.5), Google (Gemini 3.1 Pro), and xAI (Grok 4.3) - using one identical prompt so the scores compare. The published judge score is leave-one-family-out: a model is never scored by a judge of its own family, so same-family self-preference is removed.

The brief

Implement a Python function `eval_expr(s)` that evaluates an arithmetic expression given as a string and returns its numeric value. The expression language supports: - the binary operators `+`, `-`, `*`, `/` with standard precedence (`*` and `/` bind tighter than `+` and `-`) and left associativity, - parentheses `(` and `)` to override precedence, - unary `+` and `-` (for example `-5` or `3*-2`), - integer and decimal number literals (for example `7` or `1.5`), - arbitrary surrounding and internal whitespace, which is ignored. Semantics: - `/` is true division (so `7/2` is 3.5). - If the result is an exact integer value, return it as an `int` (so `6/2` returns `3`, not `3.0`); otherwise return a `float`. - Raise `ValueError` for a malformed expression (empty input, a dangling operator, mismatched parentheses, two adjacent numbers, an unknown token) and for division by zero. Do not use `eval`, `exec`, or any expression-parsing library: write the parser yourself. Use only the Python standard library. Write your solution to `solution.py`.

claude-opus-4-8

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-opus-4-8

Extra-high reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-4-6

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-5

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-fable-5

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

glm-5.2

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

kimi-k2.7-code

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gpt-5.5

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gpt-5.5-pro

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gpt-5.4-mini

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gemini-3.1-pro-preview

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gemini-3.5-flash

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

gemini-3.1-flash-lite

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

grok-4.3

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

grok-4.20-reasoning

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

grok-build-0.1

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

grok-composer-2.5-fast

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-opus-4-8

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-4-6

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-sonnet-5

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-fable-5

High reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-haiku-4-5

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

deepseek-v4-pro

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

deepseek-v4-flash

default reasoning

Composite 100.0%Objective 100.0%

Open output Full run

claude-opus-4-8

Low reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-opus-4-8

Medium reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-opus-4-8

Max reasoning

Composite 0.0%Objective 0.0%

Open output Full run

claude-haiku-4-5

High reasoning

Composite 0.0%Objective 0.0%

Open output Full run