Previous photoPaniniNext photoChicken Wrap
Human 51.5% yes48.5% no Model average 34.8% yes65.3% no Least aligned models 21-way tie bytedance-seed/seed-2.0-litebytedance-seed/seed-1.6-flashqwen/qwen-2-vl-72b-instruct+18 more Human distribution 51.5% yes, 48.5% no over 656 explicit votes. Model average distribution 34.8% yes, 65.3% no across the current model set. Closest current model 50.0% yes. Least aligned models 51.5 point gap. Legacy GPT-4o baseline 46.0% yes with a 5.5 point gap against humans. Biggest model gap 51.5 percentage points on this image. Current classification Human knife-edge Current classification Human knife-edge Models compared 67 current runs Biggest model gap 51.5 percentage points on this image. Closest model output 50.0% yes. 

CPBHuman knife-edge
Benchmark image 14
Cookie PB
Cookie and peanut butter "Sandwich"
Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit by a startup with no adult in finance. It breaks the bread prior while preserving the sandwich geometry almost too cleanly.
Under development: this benchmark and its published results are provisional, not final.
At a glance
How this photo split the room
qwen/qwen3.5-flash-02-23
21-way tie
Benchmark context
Model spread
How Models Align with Human Responses
This compares each model against human responses to show how closely it aligns with people.Human rate marker
Vote card
Generated summary for this photo



Selected human comments
qwen/qwen3.5-flash-02-23 comments
mistralai/mistral-large-2512 comments