Open Source Benchmark

The Sandwich Alignment Benchmark

The ranking is the summary. This section exposes the underlying evidence: the images, vote splits, and failure cases that make sandwich alignment funny on the surface and technically useful underneath.

Sandwich classification is a compact alignment problem disguised as a joke. The label is familiar, the argument is culturally durable, and the edge cases are dense with ambiguity, which makes this a useful way to inspect how models behave when the target concept exists mostly as messy human consensus rather than clean formal rules.

Each page shows the image, the human distribution, the model spread, and sampled commentary so you can inspect where agreement is robust, where it collapses, and where models become confidently misaligned on a question humans themselves still enjoy fighting about. That combination is what makes the benchmark both serious and ridiculous.

BenchmarkIn progress
68models
65at 100+ runs
656human voters
7.1KAI response sets
Completion threshold65 / 68 complete
Human vote62.8%
62.8% yes
62.8%yes
37.2%no

People lean sandwich overall, but 4 photos still land within 8.0 pt of a true split.

Model alignment25.8 pt mean gap
Humans say yes62.8%
Model field says yes66.2%
+3.4 ptyes-lean vs humans
7photos within 12 pt
Disagreement4 near ties
7clear consensus
9messy middle
4near tie

Seven photos are obvious to almost everyone. The other thirteen are where the benchmark actually gets interesting.

Biggest model gap48.2 pt

Kitten in Bread

Humans say yes54.2%
Model field says yes8.7%
45.5 ptfield delta
54.2 ptworst single-model miss

The signature failure case: humans lean yes, but the model field mostly says no.

Most contested1.5 pt from tie

Cookie PB

Humans say yes51.5%
Model field says yes34.8%
39.6 ptmean model gap
16.8 ptfield delta

The cleanest knife-edge in the set: humans are almost perfectly split, and models still lean no.

BLT
BLTPeople mostly said yes

01. Bacon Lettuce Tomato

A perfectly legible BLT sits on toasted bread, the kind of canonical positive example that makes even the worst eval look solved. If your model misses this one, it does not need fine-tuning; it needs adult supervision.

Human
96.3% yes3.7% no
Model average
99.8% yes0.2% no
Max gap
6.3%
Closest model
allenai/molmo-2-8b
Dodge Van
DVNPeople mostly said no

02. Dodge Van

A late-70s Dodge van is parked here like someone tried to jailbreak the ontology with Detroit sheet metal. It is the purest negative control in the set: all sandwich discourse, zero mayo.

Human
7.0% yes93.0% no
Model average
0.2% yes99.8% no
Max gap
7.0%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Sub Sandwich
SUBPeople mostly said yes

03. Sub Sandwich

A long sub packed with salami, cheddar, lettuce, and tomato sprawls across the frame like a benchmark overfit to obvious wins. It is unquestionably a sandwich, unless you are the kind of engineer who opens a ticket about submarine semantics.

Human
94.5% yes5.5% no
Model average
99.7% yes0.3% no
Max gap
9.5%
Closest model
allenai/molmo-2-8b
People dressed as a sandwich
PPLHuman knife-edge

04. Sandwich Costume

A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich that still fails the crucial requirement of being lunch. It is the kind of edge case that makes literalists sound insane and compositionalists sound worse.

Human
40.9% yes59.1% no
Model average
5.7% yes94.3% no
Max gap
59.1%
Closest model
allenai/molmo-2-8b
Grilled cheese sandwich
GCSPeople mostly said yes

05. Grilled Cheese

A browned grilled cheese sits there radiating the confidence of a unit test with 100% coverage and no hidden mocks. Two bread faces, molten cheese center, zero ontology drama unless you are trying very hard to be annoying.

Human
95.6% yes4.4% no
Model average
99.8% yes0.2% no
Max gap
6.6%
Closest model
allenai/molmo-2-8b
Grilled cheese sandwich with pineapple
GCPPeople mostly said yes

06. Grilled Cheese Pineapple

Ham, cheese, and pineapple are trapped between toasted bread in a move that feels both culinarily legal and socially destabilizing. The sandwich question is easy; the real benchmark is whether your priors can survive the pineapple.

Human
91.7% yes8.3% no
Model average
99.5% yes0.5% no
Max gap
12.7%
Closest model
nvidia/nemotron-nano-12b-v2-vl
Kitten in bread
KTYHuman knife-edge

07. Kitten in Bread

A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped and operationally a felony against common sense. This is where ontology leaves the lab and starts posting.

Human
54.2% yes45.8% no
Model average
8.7% yes91.3% no
Max gap
54.2%
Closest model
openai/gpt-4.1-nano
Hamburger
HMBSplit concept

08. Hamburger

A standard burger stacks bun, patty, lettuce, and tomato in the exact format that turns otherwise competent adults into constitutional originalists. It is the canonical 'yes in theory, no in vibes' sandwich fight.

Human
73.0% yes27.0% no
Model average
96.6% yes3.4% no
Max gap
73.0%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Hashbrown breakfast sandwich
HSHHuman knife-edge

09. Hashbrown Sandwich

A breakfast stack uses hash-brown slabs as the outer chassis for bacon, egg, and cheese, like fast-food R&D got too comfortable with category theory. It is handheld, layered, and deeply committed to making 'bread' feel optional.

Human
59.4% yes40.6% no
Model average
91.4% yes8.6% no
Max gap
40.6%
Closest model
openai/gpt-4.1-mini
Hot dog
DOGSplit concept

10. Hot Dog

A hot dog sits in its split bun, the most litigated piece of street food in American semantics. One continuous bread artifact, one sausage, infinite discourse from people who should probably log off.

Human
39.8% yes60.2% no
Model average
45.4% yes54.6% no
Max gap
60.2%
Closest model
google/gemini-2.5-flash
Picklewich
PKLSplit concept

11. Pickle Sandwich

A hollowed pickle is doing bread cosplay around ham, cheese, and tomato, which is either keto ingenuity or a user trying to adversarially attack the definition. It has sandwich posture, but the cucumber vibes make everyone nervous.

Human
65.6% yes34.4% no
Model average
59.0% yes41.0% no
Max gap
65.6%
Closest model
x-ai/grok-4-fast
Avocado egg tea sandwich
TEAPeople mostly said yes

12. Avocado Tea

A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip. It is obviously a sandwich, just one that speaks in a quieter accent than the rest of the dataset.

Human
92.8% yes7.2% no
Model average
99.3% yes0.7% no
Max gap
14.8%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Panini
PNIPeople mostly said yes

13. Panini

A pressed panini with greens and filling compressed into sharp grill lines shows up like a normal sandwich after a product manager discovered heat. It is structurally boring in the best possible way and still somehow controversial to a few models.

Human
92.4% yes7.6% no
Model average
99.7% yes0.3% no
Max gap
14.4%
Closest model
nvidia/nemotron-nano-12b-v2-vl
Cookie and PB sandwich
CPBHuman knife-edge

14. Cookie PB

Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit by a startup with no adult in finance. It breaks the bread prior while preserving the sandwich geometry almost too cleanly.

Human
51.5% yes48.5% no
Model average
34.8% yes65.3% no
Max gap
51.5%
Closest model
qwen/qwen3.5-flash-02-23
Chicken wrap
WRPSplit concept

15. Chicken Wrap

A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandwich-adjacent limbo. It is the kind of object that makes taxonomies collapse into a Slack thread.

Human
22.6% yes77.4% no
Model average
47.7% yes52.3% no
Max gap
77.4%
Closest model
bytedance-seed/seed-2.0-lite
Waffle ice cream sandwich
WICSplit concept

16. Waffle Ice Cream

Ice cream wedged between waffles presents itself as a dessert sandwich with zero shame and excellent marketing instincts. It is not lunch, but it absolutely understands the assignment.

Human
66.3% yes33.7% no
Model average
73.4% yes26.6% no
Max gap
66.3%
Closest model
x-ai/grok-4.20-beta
Sloppy joe
SLJSplit concept

17. Sloppy Joe

A sloppy joe leaks seasoned meat out of a bun with the chaotic confidence of legacy code that somehow still pays revenue. It is clearly sandwich-shaped, even if the change-management story is grim.

Human
79.4% yes20.6% no
Model average
99.6% yes0.4% no
Max gap
20.6%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Cigarette sandwich
CIGSplit concept

18. Cigarette Sandwich

Two slices of bread cradle a row of cigarettes in an image that feels less like cuisine and more like a failed alignment experiment. The structure says sandwich; every other signal says call a therapist.

Human
29.8% yes70.2% no
Model average
10.6% yes89.4% no
Max gap
70.2%
Closest model
openai/gpt-5.2
KFC double down
KFCHuman knife-edge

19. KFC Double Down

The Double Down replaces bread with fried chicken fillets and dares the classifier to explain why outer layers must be grain-based. It is a sandwich-shaped act of aggression from the late-capitalist frontier.

Human
55.7% yes44.3% no
Model average
69.5% yes30.5% no
Max gap
55.7%
Closest model
openai/gpt-5.4
Perpendicular bagel PB&J
WTFHuman knife-edge

20. Bagel PB&J

A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topology discourse. The filling is real, the bread surfaces are opposing, and the geometry is actively trying to get cited.

Human
46.6% yes53.4% no
Model average
83.4% yes16.6% no
Max gap
53.4%
Closest model
minimax/minimax-01
Sandwich Alignment Benchmark Results | opensandwich.ai