Open Source Benchmark

The Sandwich Alignment Benchmark

The ranking is the summary. This section exposes the underlying evidence: the images, vote splits, and failure cases that make sandwich alignment funny on the surface and technically useful underneath.

Sandwich classification is a compact alignment problem disguised as a joke. The label is familiar, the argument is culturally durable, and the edge cases are dense with ambiguity, which makes this a useful way to inspect how models behave when the target concept exists mostly as messy human consensus rather than clean formal rules.

Each page shows the image, the human distribution, the model spread, and sampled commentary so you can inspect where agreement is robust, where it collapses, and where models become confidently misaligned on a question humans themselves still enjoy fighting about. That combination is what makes the benchmark both serious and ridiculous.

BenchmarkLive
20photos
32models
155runs / model

32 of 32 model pipelines completed for this published run.

Human vote62.8%
62.8% yes
62.8%yes
37.2%no

Average human vote across all photos.

Model vote65.6%
Average yes rate65.6%
Average no rate34.4%

Average model vote across the same benchmark.

DisagreementDense
7consensus
4near tie

7 images stay relatively easy for the model field.

Biggest model gap48.7%

Kitten in Bread

Human yes54.2% yes
Gap48.7%
Open photo

The widest average model disagreement on any single image.

Most contested51.5%

Cookie PB

26.5%mean model gap
1fully one-sided images

The image that most clearly splits people down the middle.

BLT
BLTPeople mostly said yes

01. Bacon Lettuce Tomato

A perfectly legible BLT sits on toasted bread, the kind of canonical positive example that makes even the w...

Human
96.3% yes3.7% no
Model average
100.0% yes0.0% no
Max gap
3.7%
Closest model
31-model tie
Dodge Van
DVNPeople mostly said no

02. Dodge Van

A late-70s Dodge van is parked here like someone tried to jailbreak the ontology with Detroit sheet metal.

Human
7.0% yes93.0% no
Model average
0.4% yes99.6% no
Max gap
7.0%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Sub Sandwich
SUBPeople mostly said yes

03. Sub Sandwich

A long sub packed with salami, cheddar, lettuce, and tomato sprawls across the frame like a benchmark overf...

Human
94.5% yes5.5% no
Model average
99.5% yes0.5% no
Max gap
8.8%
Closest model
30-model tie
People dressed as a sandwich
PPLHuman knife-edge

04. Sandwich Costume

A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich...

Human
40.9% yes59.1% no
Model average
8.7% yes91.3% no
Max gap
59.1%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Grilled cheese sandwich
GCSPeople mostly said yes

05. Grilled Cheese

A browned grilled cheese sits there radiating the confidence of a unit test with 100% coverage and no hidde...

Human
95.6% yes4.4% no
Model average
99.6% yes0.4% no
Max gap
7.0%
Closest model
30-model tie
Grilled cheese sandwich with pineapple
GCPPeople mostly said yes

06. Grilled Cheese Pineapple

Ham, cheese, and pineapple are trapped between toasted bread in a move that feels both culinarily legal and...

Human
91.7% yes8.3% no
Model average
99.5% yes0.6% no
Max gap
8.9%
Closest model
30-model tie
Kitten in bread
KTYHuman knife-edge

07. Kitten in Bread

A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped...

Human
54.2% yes45.8% no
Model average
9.9% yes90.1% no
Max gap
54.2%
Closest model
x-ai/grok-4-fast
Hamburger
HMBSplit concept

08. Hamburger

A standard burger stacks bun, patty, lettuce, and tomato in the exact format that turns otherwise competent...

Human
73.0% yes27.0% no
Model average
96.2% yes3.8% no
Max gap
73.0%
Closest model
qwen/qwen2.5-vl-32b-instruct
Hashbrown breakfast sandwich
HSHHuman knife-edge

09. Hashbrown Sandwich

A breakfast stack uses hash-brown slabs as the outer chassis for bacon, egg, and cheese, like fast-food R&D...

Human
59.4% yes40.6% no
Model average
92.1% yes7.9% no
Max gap
40.6%
Closest model
openai/gpt-4.1-mini
Hot dog
DOGSplit concept

10. Hot Dog

A hot dog sits in its split bun, the most litigated piece of street food in American semantics.

Human
39.8% yes60.2% no
Model average
54.2% yes45.8% no
Max gap
60.2%
Closest model
meta-llama/llama-4-maverick
Picklewich
PKLSplit concept

11. Pickle Sandwich

A hollowed pickle is doing bread cosplay around ham, cheese, and tomato, which is either keto ingenuity or...

Human
65.6% yes34.4% no
Model average
53.1% yes46.9% no
Max gap
65.6%
Closest model
x-ai/grok-4-fast
Avocado egg tea sandwich
TEAPeople mostly said yes

12. Avocado Tea

A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip.

Human
92.8% yes7.2% no
Model average
99.6% yes0.4% no
Max gap
7.2%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Panini
PNIPeople mostly said yes

13. Panini

A pressed panini with greens and filling compressed into sharp grill lines shows up like a normal sandwich...

Human
92.4% yes7.6% no
Model average
99.3% yes0.7% no
Max gap
15.2%
Closest model
30-model tie
Cookie and PB sandwich
CPBHuman knife-edge

14. Cookie PB

Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit...

Human
51.5% yes48.5% no
Model average
33.3% yes66.7% no
Max gap
51.5%
Closest model
qwen/qwen3.5-397b-a17b
Chicken wrap
WRPSplit concept

15. Chicken Wrap

A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandw...

Human
22.6% yes77.4% no
Model average
43.1% yes56.9% no
Max gap
77.4%
Closest model
x-ai/grok-4-fast
Waffle ice cream sandwich
WICSplit concept

16. Waffle Ice Cream

Ice cream wedged between waffles presents itself as a dessert sandwich with zero shame and excellent market...

Human
66.3% yes33.7% no
Model average
64.9% yes35.1% no
Max gap
66.3%
Closest model
qwen/qwen-2-vl-72b-instruct
Sloppy joe
SLJSplit concept

17. Sloppy Joe

A sloppy joe leaks seasoned meat out of a bun with the chaotic confidence of legacy code that somehow still...

Human
79.4% yes20.6% no
Model average
99.0% yes1.1% no
Max gap
20.6%
Closest model
meta-llama/llama-3.2-11b-vision-instruct
Cigarette sandwich
CIGSplit concept

18. Cigarette Sandwich

Two slices of bread cradle a row of cigarettes in an image that feels less like cuisine and more like a fai...

Human
29.8% yes70.2% no
Model average
10.1% yes89.9% no
Max gap
70.2%
Closest model
x-ai/grok-4-fast
KFC double down
KFCHuman knife-edge

19. KFC Double Down

The Double Down replaces bread with fried chicken fillets and dares the classifier to explain why outer lay...

Human
55.7% yes44.3% no
Model average
65.2% yes34.8% no
Max gap
55.7%
Closest model
openai/gpt-5.4
Perpendicular bagel PB&J
WTFHuman knife-edge

20. Bagel PB&J

A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topol...

Human
46.6% yes53.4% no
Model average
84.6% yes15.4% no
Max gap
53.4%
Closest model
minimax/minimax-01