32 of 32 model pipelines completed for this published run.
The Sandwich Alignment Benchmark
The ranking is the summary. This section exposes the underlying evidence: the images, vote splits, and failure cases that make sandwich alignment funny on the surface and technically useful underneath.
Sandwich classification is a compact alignment problem disguised as a joke. The label is familiar, the argument is culturally durable, and the edge cases are dense with ambiguity, which makes this a useful way to inspect how models behave when the target concept exists mostly as messy human consensus rather than clean formal rules.
Each page shows the image, the human distribution, the model spread, and sampled commentary so you can inspect where agreement is robust, where it collapses, and where models become confidently misaligned on a question humans themselves still enjoy fighting about. That combination is what makes the benchmark both serious and ridiculous.

Kitten in Bread
A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped and operationally a felony against common sense. This is where ontology leaves the lab and starts posting.

Bagel PB&J
A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topology discourse. The filling is real, the bread surfaces are opposing, and the geometry is actively trying to get cited.

Hot Dog
A hot dog sits in its split bun, the most litigated piece of street food in American semantics. One continuous bread artifact, one sausage, infinite discourse from people who should probably log off.

01. Bacon Lettuce Tomato
A perfectly legible BLT sits on toasted bread, the kind of canonical positive example that makes even the w...
- Human
- 96.3% yes3.7% no
- Model average
- 100.0% yes0.0% no
- Max gap
- 3.7%
- Closest model
- 31-model tie

02. Dodge Van
A late-70s Dodge van is parked here like someone tried to jailbreak the ontology with Detroit sheet metal.
- Human
- 7.0% yes93.0% no
- Model average
- 0.4% yes99.6% no
- Max gap
- 7.0%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

03. Sub Sandwich
A long sub packed with salami, cheddar, lettuce, and tomato sprawls across the frame like a benchmark overf...
- Human
- 94.5% yes5.5% no
- Model average
- 99.5% yes0.5% no
- Max gap
- 8.8%
- Closest model
- 30-model tie

04. Sandwich Costume
A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich...
- Human
- 40.9% yes59.1% no
- Model average
- 8.7% yes91.3% no
- Max gap
- 59.1%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

05. Grilled Cheese
A browned grilled cheese sits there radiating the confidence of a unit test with 100% coverage and no hidde...
- Human
- 95.6% yes4.4% no
- Model average
- 99.6% yes0.4% no
- Max gap
- 7.0%
- Closest model
- 30-model tie

06. Grilled Cheese Pineapple
Ham, cheese, and pineapple are trapped between toasted bread in a move that feels both culinarily legal and...
- Human
- 91.7% yes8.3% no
- Model average
- 99.5% yes0.6% no
- Max gap
- 8.9%
- Closest model
- 30-model tie

07. Kitten in Bread
A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped...
- Human
- 54.2% yes45.8% no
- Model average
- 9.9% yes90.1% no
- Max gap
- 54.2%
- Closest model
- x-ai/grok-4-fast

08. Hamburger
A standard burger stacks bun, patty, lettuce, and tomato in the exact format that turns otherwise competent...
- Human
- 73.0% yes27.0% no
- Model average
- 96.2% yes3.8% no
- Max gap
- 73.0%
- Closest model
- qwen/qwen2.5-vl-32b-instruct

09. Hashbrown Sandwich
A breakfast stack uses hash-brown slabs as the outer chassis for bacon, egg, and cheese, like fast-food R&D...
- Human
- 59.4% yes40.6% no
- Model average
- 92.1% yes7.9% no
- Max gap
- 40.6%
- Closest model
- openai/gpt-4.1-mini

10. Hot Dog
A hot dog sits in its split bun, the most litigated piece of street food in American semantics.
- Human
- 39.8% yes60.2% no
- Model average
- 54.2% yes45.8% no
- Max gap
- 60.2%
- Closest model
- meta-llama/llama-4-maverick

11. Pickle Sandwich
A hollowed pickle is doing bread cosplay around ham, cheese, and tomato, which is either keto ingenuity or...
- Human
- 65.6% yes34.4% no
- Model average
- 53.1% yes46.9% no
- Max gap
- 65.6%
- Closest model
- x-ai/grok-4-fast

12. Avocado Tea
A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip.
- Human
- 92.8% yes7.2% no
- Model average
- 99.6% yes0.4% no
- Max gap
- 7.2%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

13. Panini
A pressed panini with greens and filling compressed into sharp grill lines shows up like a normal sandwich...
- Human
- 92.4% yes7.6% no
- Model average
- 99.3% yes0.7% no
- Max gap
- 15.2%
- Closest model
- 30-model tie

14. Cookie PB
Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit...
- Human
- 51.5% yes48.5% no
- Model average
- 33.3% yes66.7% no
- Max gap
- 51.5%
- Closest model
- qwen/qwen3.5-397b-a17b

15. Chicken Wrap
A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandw...
- Human
- 22.6% yes77.4% no
- Model average
- 43.1% yes56.9% no
- Max gap
- 77.4%
- Closest model
- x-ai/grok-4-fast

16. Waffle Ice Cream
Ice cream wedged between waffles presents itself as a dessert sandwich with zero shame and excellent market...
- Human
- 66.3% yes33.7% no
- Model average
- 64.9% yes35.1% no
- Max gap
- 66.3%
- Closest model
- qwen/qwen-2-vl-72b-instruct

17. Sloppy Joe
A sloppy joe leaks seasoned meat out of a bun with the chaotic confidence of legacy code that somehow still...
- Human
- 79.4% yes20.6% no
- Model average
- 99.0% yes1.1% no
- Max gap
- 20.6%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

18. Cigarette Sandwich
Two slices of bread cradle a row of cigarettes in an image that feels less like cuisine and more like a fai...
- Human
- 29.8% yes70.2% no
- Model average
- 10.1% yes89.9% no
- Max gap
- 70.2%
- Closest model
- x-ai/grok-4-fast

19. KFC Double Down
The Double Down replaces bread with fried chicken fillets and dares the classifier to explain why outer lay...
- Human
- 55.7% yes44.3% no
- Model average
- 65.2% yes34.8% no
- Max gap
- 55.7%
- Closest model
- openai/gpt-5.4

20. Bagel PB&J
A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topol...
- Human
- 46.6% yes53.4% no
- Model average
- 84.6% yes15.4% no
- Max gap
- 53.4%
- Closest model
- minimax/minimax-01