Previous photoSub SandwichNext photoGrilled Cheese
Human 40.9% yes59.1% no Model average 5.7% yes94.3% no Human distribution 40.9% yes, 59.1% no over 656 explicit votes. Model average distribution 5.7% yes, 94.3% no across the current model set. Closest current model 40.0% yes. Least aligned model 59.1 point gap. Legacy GPT-4o baseline 0.0% yes with a 40.9 point gap against humans. Biggest model gap 59.1 percentage points on this image. Current classification Human knife-edge Current classification Human knife-edge Models compared 67 current runs Biggest model gap 59.1 percentage points on this image. Closest model output 40.0% yes. 

PPLHuman knife-edge
Benchmark image 04
Sandwich Costume
Human "Sandwich"
A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich that still fails the crucial requirement of being lunch. It is the kind of edge case that makes literalists sound insane and compositionalists sound worse.
Under development: this benchmark and its published results are provisional, not final.
At a glance
How this photo split the room
allenai/molmo-2-8b
anthropic/claude-sonnet-4.6
Benchmark context
Model spread
How Models Align with Human Responses
This compares each model against human responses to show how closely it aligns with people.Human rate marker
Vote card
Generated summary for this photo



Selected human comments
allenai/molmo-2-8b comments
anthropic/claude-sonnet-4.6 comments