Previous photoSub Sandwich Next photoGrilled Cheese

PPLHuman knife-edge

Benchmark image 04

Sandwich Costume

Human "Sandwich"

A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich that still fails the crucial requirement of being lunch. It is the kind of edge case that makes literalists sound insane and compositionalists sound worse.

Under development: this benchmark and its published results are provisional, not final.

Human

40.9% yes59.1% no

Model average

6.0% yes94.0% no

Most aligned model

0.8 point gap from humans

allenai/molmo-2-8b

Least aligned model

59.1 point gap from humans

anthropic/claude-sonnet-4.6

At a glance

How this photo split the room

Human distribution

40.9% yes, 59.1% no over 656 explicit votes.

Model average distribution

6.0% yes, 94.0% no across the current model set.

Closest current model

40.0% yes.

allenai/molmo-2-8b

Least aligned model

59.1 point gap.

anthropic/claude-sonnet-4.6

Legacy GPT-4o baseline

0.0% yes with a 40.9 point gap against humans.

Biggest model gap

59.1 percentage points on this image.

Current classification

Human knife-edge

Benchmark context

Current classification

Human knife-edge

Models compared

74 current runs

Biggest model gap

59.1 percentage points on this image.

Closest model output

40.0% yes.

Model spread

How Models Align with Human Responses

This compares each model against human responses to show how closely it aligns with people.Human rate marker

amazon/nova-2-lite-v1

100.0% no0.0% yes

Human gap40.9%

Rank #60

amazon/nova-lite-v1

100.0% no0.0% yes

Human gap40.9%

Rank #51

amazon/nova-pro-v1

100.0% no0.0% yes

Human gap40.9%

Rank #73

anthropic/claude-haiku-4.5

100.0% no0.0% yes

Human gap40.9%

Rank #52

anthropic/claude-opus-4.5

100.0% no0.0% yes

Human gap40.9%

Rank #56

anthropic/claude-opus-4.8

100.0% no0.0% yes

Human gap40.9%

Rank #40

baidu/ernie-4.5-vl-28b-a3b

100.0% no0.0% yes

Human gap40.9%

Rank #69

bytedance-seed/seed-1.6

100.0% no0.0% yes

Human gap40.9%

Rank #41

bytedance-seed/seed-1.6-flash

100.0% no0.0% yes

Human gap40.9%

Rank #20

bytedance-seed/seed-2.0-mini

100.0% no0.0% yes

Human gap40.9%

Rank #19

google/gemini-2.5-flash

100.0% no0.0% yes

Human gap40.9%

Rank #21

google/gemini-2.5-flash-lite

100.0% no0.0% yes

Human gap40.9%

Rank #54

google/gemini-3-flash-preview

100.0% no0.0% yes

Human gap40.9%

Rank #75

google/gemini-3-pro-image-preview

100.0% no0.0% yes

Human gap40.9%

Rank #42

google/gemini-3.1-flash-lite-preview

100.0% no0.0% yes

Human gap40.9%

Rank #55

google/gemini-3.1-pro-preview

100.0% no0.0% yes

Human gap40.9%

Rank #45

google/gemma-3-12b-it

100.0% no0.0% yes

Human gap40.9%

Rank #26

google/gemma-3-27b-it

100.0% no0.0% yes

Human gap40.9%

Rank #48

GPT-4o (Spring 2024)

100.0% no0.0% yes

Human gap40.9%

Rank #4

meta-llama/llama-4-maverick

100.0% no0.0% yes

Human gap40.9%

Rank #68

meta-llama/llama-4-scout

100.0% no0.0% yes

Human gap40.9%

Rank #33

minimax/minimax-01

100.0% no0.0% yes

Human gap40.9%

Rank #72

mistralai/mistral-large-2512

100.0% no0.0% yes

Human gap40.9%

Rank #71

openai/gpt-4.1

100.0% no0.0% yes

Human gap40.9%

Rank #74

openai/gpt-4.1-mini

100.0% no0.0% yes

Human gap40.9%

Rank #57

openai/gpt-4.1-nano

100.0% no0.0% yes

Human gap40.9%

Rank #36

openai/gpt-4o

100.0% no0.0% yes

Human gap40.9%

Rank #15

openai/gpt-4o-2024-11-20

100.0% no0.0% yes

Human gap40.9%

Rank #67

openai/gpt-4o-mini

100.0% no0.0% yes

Human gap40.9%

Rank #61

openai/gpt-5.1

100.0% no0.0% yes

Human gap40.9%

Rank #49

openai/gpt-5.1-chat

100.0% no0.0% yes

Human gap40.9%

Rank #8

openai/gpt-5.1-codex

100.0% no0.0% yes

Human gap40.9%

Rank #37

openai/gpt-5.2

100.0% no0.0% yes

Human gap40.9%

Rank #43

openai/gpt-5.3-chat

100.0% no0.0% yes

Human gap40.9%

Rank #30

openai/gpt-5.3-codex

100.0% no0.0% yes

Human gap40.9%

Rank #44

openai/gpt-5.4

100.0% no0.0% yes

Human gap40.9%

Rank #59

openai/gpt-5.4-mini

100.0% no0.0% yes

Human gap40.9%

Rank #28

openai/gpt-5.4-nano

100.0% no0.0% yes

Human gap40.9%

Rank #31

openai/gpt-5.4-pro

100.0% no0.0% yes

Human gap40.9%

Rank #65

openai/gpt-5.5

100.0% no0.0% yes

Human gap40.9%

Rank #46

openai/o1

100.0% no0.0% yes

Human gap40.9%

Rank #2

openai/o1-pro

100.0% no0.0% yes

Human gap40.9%

Rank #1

openai/o3

100.0% no0.0% yes

Human gap40.9%

Rank #64

openai/o3-pro

100.0% no0.0% yes

Human gap40.9%

Rank #53

openrouter/healer-alpha

100.0% no0.0% yes

Human gap40.9%

Rank #10

perplexity/sonar-pro-search

100.0% no0.0% yes

Human gap40.9%

Rank #32

qwen/qwen2.5-vl-32b-instruct

100.0% no0.0% yes

Human gap40.9%

Rank #39

qwen/qwen3-vl-235b-a22b-instruct

100.0% no0.0% yes

Human gap40.9%

Rank #47

qwen/qwen3-vl-30b-a3b-instruct

100.0% no0.0% yes

Human gap40.9%

Rank #66

qwen/qwen3-vl-30b-a3b-thinking

100.0% no0.0% yes

Human gap40.9%

Rank #22

qwen/qwen3.5-122b-a10b

100.0% no0.0% yes

Human gap40.9%

Rank #11

qwen/qwen3.5-35b-a3b

100.0% no0.0% yes

Human gap40.9%

Rank #23

qwen/qwen3.5-9b

100.0% no0.0% yes

Human gap40.9%

Rank #27

qwen/qwen3.5-flash-02-23

100.0% no0.0% yes

Human gap40.9%

Rank #9

x-ai/grok-4.1-fast

100.0% no0.0% yes

Human gap40.9%

Rank #16

x-ai/grok-4.20-beta

100.0% no0.0% yes

Human gap40.9%

Rank #17

z-ai/glm-4.6v

100.0% no0.0% yes

Human gap40.9%

Rank #58

moonshotai/kimi-k2.5

99.0% no1.0% yes

Human gap39.9%

Rank #13

qwen/qwen3.5-27b

99.0% no1.0% yes

Human gap39.9%

Rank #18

qwen/qwen2.5-vl-72b-instruct

97.4% no2.6% yes

Human gap38.3%

Rank #70

nvidia/nemotron-nano-12b-v2-vl

97.0% no3.0% yes

Human gap37.9%

Rank #7

qwen/qwen-2-vl-72b-instruct

96.0% no4.0% yes

Human gap36.9%

Rank #29

qwen/qwen3.5-397b-a17b

96.0% no4.0% yes

Human gap36.9%

Rank #34

mistralai/pixtral-large-2411

94.0% no6.0% yes

Human gap34.9%

Rank #50

qwen/qwen3.5-plus-02-15

94.0% no6.0% yes

Human gap34.9%

Rank #35

x-ai/grok-4-fast

92.0% no8.0% yes

Human gap32.9%

Rank #5

google/gemini-2.5-pro

89.0% no11.0% yes

Human gap29.9%

Rank #25

x-ai/grok-4

87.0% no13.0% yes

Human gap27.9%

Rank #12

google/gemini-3.1-flash-image-preview

86.0% no14.0% yes

Human gap26.9%

Rank #24

bytedance-seed/seed-2.0-lite

71.0% no29.0% yes

Human gap11.8%

Rank #14

meta-llama/llama-3.2-11b-vision-instruct

62.0% no38.0% yes

Human gap2.9%

Rank #3

allenai/molmo-2-8b

60.0% no40.0% yes

Human gap0.8%

Rank #6

anthropic/claude-opus-4.7

36.0% no64.0% yes

Human gap23.1%

Rank #38

anthropic/claude-opus-4.6

2.0% no98.0% yes

Human gap57.1%

Rank #63

anthropic/claude-sonnet-4.6

0.0% no100.0% yes

Human gap59.1%

Rank #62

Selected human comments

No, it is a photo of a human sandwich.
only if they were all smooshed together! but separated? nope, that's ingredients
Is that Ranger Igor wearing the tomato? This is clearly a flock of Rangers, not a sandwich!

allenai/molmo-2-8b comments

While the sandwich-making characters are humorous, a real sandwich requires bread as an essential component. These costumes represent ingredients rather than the complete sandwich.
This image depicts people dressed as sandwich ingredients, but they are not actual food items. However, as they represent the components of a sandwich, the overall concept is still a sandwich.

anthropic/claude-sonnet-4.6 comments

Five people costumed as sandwich components (bread, cheese, meat, pickle, tomato, bread) marching in parade formation. Collectively, they ARE a sandwich. Arguably the most structurally accurate sandwich ever documented.
Five people dressed as sandwich components (bread, cheese, meat, onion, tomato, bread) in a parade. Collectively, they form a sandwich. Technically assembled, this BLT-ish creation qualifies, even if the ingredients are human.

Vote card

Sandwich Costume

How this photo split the room

How Models Align with Human Responses

Selected human comments

allenai/molmo-2-8b comments

anthropic/claude-sonnet-4.6 comments

Generated summary for this photo