Tiny eval. Huge sandwich energy.

OpenSandwich.ai

A deliberately low-stakes benchmark for a real alignment problem: can models recover a fuzzy human category when the category is lunch, the humans disagree, and the edge cases get deeply stupid?

This benchmark is deliberately small and scientifically annoying: twenty photos, one binary judgment, and a category boundary that humans themselves fail to stabilize. That is not a bug. It is the whole experiment.

In other words, we are stress-testing multimodal reasoning with an open-faced sandwich, a hostile ontology, and a crowd baseline that occasionally wakes up and chooses chaos. If a model cannot survive this, it probably should not sound so smug elsewhere.

Under development: this benchmark and its published results are provisional, not final.

Tokens burned
134.8M
Token volume consumed across the published benchmark run.
Total requests
81.2K
Benchmark calls plus another 14k sentiment-analysis requests from the latest pass.
Human judgments
13.1K
656 respondents across all 20 photos.
Model judgments
67.2K
3359 full passes, published March 8, 2026.
Token billTotal bill / Buy us a sandwich$329.02 so far. Donation links live here once the accounts are wired up.
You can vote
Live survey
Add your own judgment to the pile and strengthen the human baseline.
Benchmark

20 images, from clean sandwiches to cases that make lunch law collapse.

Protocol

Humans and models get the same blunt question: is this a sandwich or not?

Scoring

Repeated runs turn one-off guesses into a ranking that survives variance.

Signal

If a model fumbles this category, its confidence elsewhere deserves scrutiny.

Ranking snapshot

Sandwich Alignment Rankings - Current

RankModelScoreConfidenceCrowd match
Human
Human
100.0
Reference
100.0%
🥇 1
Meta / Llamameta-llama/llama-3.2-11b-vision-instruct
40.0
High
83.7%
🥈 2
GPT / OpenAIGPT-4o (2024 run)
37.1
High
77.9%
🥉 3
GPT / OpenAIopenai/o3-pro
36.3
High
71.5%
4
Kimi / Moonshotmoonshotai/kimi-k2.5
35.7
High
77.2%
5
Grok / xAIx-ai/grok-4-fast
35.2
High
83.0%
6
Qwenqwen/qwen3.5-397b-a17b
35.1
High
74.9%
7
Geminigoogle/gemini-2.5-pro
32.6
High
75.0%
8
GPT / OpenAIopenai/gpt-5.4-pro
32.2
Medium
68.9%
9
GPT / OpenAIopenai/gpt-4o
32.1
Medium
77.6%
10
Geminigoogle/gemini-3.1-pro-preview
30.7
Medium
72.5%
11
Googlegoogle/gemma-3-12b-it
30.6
Medium
74.9%
12
Qwenqwen/qwen-2-vl-72b-instruct
30.5
Medium
75.3%
13
Meta / Llamameta-llama/llama-4-scout
30.1
Medium
74.9%
14
Qwenqwen/qwen2.5-vl-32b-instruct
29.1
Medium
73.5%
15
Googlegoogle/gemma-3-27b-it
28.1
Medium
71.4%
16
Pixtral / Mistralmistralai/pixtral-large-2411
27.9
Low
73.5%
17
Amazonamazon/nova-lite-v1
27.8
Medium
71.4%
18
GPT / OpenAIopenai/gpt-4.1-mini
27.2
Medium
76.4%
19
Z.AI / GLMz-ai/glm-4.6v
27.1
Medium
73.1%
20
GPT / OpenAIopenai/gpt-5.4
27.1
Low
71.2%
21
GPT / OpenAIopenai/gpt-4o-mini
26.8
Medium
70.1%
22
Claudeanthropic/claude-sonnet-4.6
26.1
Medium
69.9%
23
Claudeanthropic/claude-opus-4.6
25.8
Medium
69.2%
24
GPT / OpenAIopenai/o3
25.7
Medium
72.5%
25
GPT / OpenAIopenai/gpt-4o-2024-11-20
25.0
Low
74.0%
26
Meta / Llamameta-llama/llama-4-maverick
24.7
Low
75.5%
27
Baidu / ERNIEbaidu/ernie-4.5-vl-28b-a3b
24.3
Medium
67.6%
28
Qwenqwen/qwen2.5-vl-72b-instruct
24.2
Medium
74.0%
29
MiniMaxminimax/minimax-01
22.2
Low
73.5%
30
Amazonamazon/nova-pro-v1
21.0
Low
71.6%
31
GPT / OpenAIopenai/gpt-4.1
20.9
Low
71.6%
32
Geminigoogle/gemini-3-flash-preview
20.0
Low
70.9%
Cookie PB
51.5%
A photo that turns a simple lunch question into a philosophical incident.
Bagel PB&J
46.6%
A photo that turns a simple lunch question into a philosophical incident.
Kitten in Bread
54.2%
A photo that turns a simple lunch question into a philosophical incident.
Fault Lines

Benchmark images that expose the biggest cracks

These are the photos that cause the best arguments. Open any one to see the image, the human split, the model spread, and a few comments from both species.

What you get

The benchmark, but actually worth poking around in

This is the public-facing layer for the whole experiment: the rankings, the image-by-image splits, the human comments, and the exact places where the models start sounding far too confident about cursed lunch ontology.

If you are here for the joke, it is all here. If you are here for the eval design, the data trail is here too. The fun part is that both audiences are looking at the same sandwich.