Models
Model-by-model sandwich analytics
This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.
The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.
Under development: this benchmark and its published results are provisional, not final.
Best gap#1
meta-llama/llama-3.2-11b-vision-instructCrowd gap16.3%
Score-53.6
16.3%mean gap
-53.6score
Widest miss32.4%
baidu/ernie-4.5-vl-28b-a3bCrowd gap32.4%
Rank#27
-1227.3score
$0.30cost
Cheapest$0.00
GPT-4o (2024 run)Spend$0.00
Prompt sharen/a
0tokens
0retries
Highest spend$17.52
google/gemini-3.1-pro-previewSpend$17.52
Rank#10
3.1Mtokens
11retries
Most tokens45.8M
openai/gpt-4o-miniToken load45.8M
Prompt share99.8%
$3.41cost
0retries
Most retries227
google/gemini-2.5-proRetry load227
Completion100.0%
$16.85spend
4.3Mtokens
Leaderboard
Alignment, total workload, and rank in one table
| Rank | Model | Alignment | Tokens | Images Evaluated | Total Eval Runs | Best Fit | Worst Miss |
|---|---|---|---|---|---|---|---|
| 1 | meta-llama/llama-3.2-11b-vision-instruct | -53.616.3% mean gap | 3,177,118 | 700 | 35 | 1.4 ptAvocado Tea | 45.8 ptCookie PB |
| 2 | GPT-4o (2024 run) | -269.322.1% mean gap | 0 | 1,000 | 50 | n/a | n/aNo image gap |
| 3 | openai/o3-pro | -330.828.5% mean gap | 757,641 | 720 | 36 | 3.7 ptBacon Lettuce Tomato | 65.6 ptPickle Sandwich |
| 4 | moonshotai/kimi-k2.5 | -373.922.8% mean gap | 2,216,641 | 1,220 | 61 | 3.7 ptBacon Lettuce Tomato | 54.2 ptKitten in Bread |
| 5 | x-ai/grok-4-fast | -408.317.0% mean gap | 1,599,381 | 1,980 | 99 | 0.1 ptPickle Sandwich | 53.4 ptBagel PB&J |
| 6 | qwen/qwen3.5-397b-a17b | -422.525.1% mean gap | 1,723,128 | 1,100 | 55 | 3.7 ptBacon Lettuce Tomato | 75.6 ptChicken Wrap |
| 7 | google/gemini-2.5-pro | -604.325.0% mean gap | 4,307,134 | 1,620 | 81 | 3.7 ptBacon Lettuce Tomato | 54.2 ptKitten in Bread |
| 8 | openai/gpt-5.4-pro | -632.031.1% mean gap | 1,146,134 | 1,120 | 56 | 3.7 ptBacon Lettuce Tomato | 75.6 ptChicken Wrap |
| 9 | openai/gpt-4o | -641.122.4% mean gap | 1,677,512 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 60.2 ptHot Dog |
| 10 | google/gemini-3.1-pro-preview | -747.327.5% mean gap | 3,090,040 | 1,600 | 80 | 3.7 ptBacon Lettuce Tomato | 64.3 ptPickle Sandwich |
| 11 | google/gemma-3-12b-it | -752.325.1% mean gap | 874,701 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 54.2 ptKitten in Bread |
| 12 | qwen/qwen-2-vl-72b-instruct | -760.024.7% mean gap | 1,888,854 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 54.2 ptKitten in Bread |
| 13 | meta-llama/llama-4-scout | -789.825.1% mean gap | 2,654,272 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 53.4 ptBagel PB&J |
| 14 | qwen/qwen2.5-vl-32b-instruct | -866.126.5% mean gap | 1,905,598 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 66.3 ptWaffle Ice Cream |
| 15 | google/gemma-3-27b-it | -940.628.6% mean gap | 863,650 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 66.3 ptWaffle Ice Cream |
| 16 | mistralai/pixtral-large-2411 | -954.926.5% mean gap | 3,964,789 | 2,000 | 100 | 1.4 ptPickle Sandwich | 77.4 ptChicken Wrap |
| 17 | amazon/nova-lite-v1 | -962.428.6% mean gap | 3,447,484 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 77.4 ptChicken Wrap |
| 18 | openai/gpt-4.1-mini | -1006.723.6% mean gap | 3,548,876 | 3,100 | 155 | 3.7 ptBacon Lettuce Tomato | 55.9 ptWaffle Ice Cream |
| 19 | z-ai/glm-4.6v | -1014.726.9% mean gap | 2,712,263 | 2,360 | 118 | 3.7 ptBacon Lettuce Tomato | 61.3 ptChicken Wrap |
| 20 | openai/gpt-5.4 | -1018.728.8% mean gap | 1,767,810 | 2,000 | 100 | 1.7 ptKFC Double Down | 77.4 ptChicken Wrap |
| 21 | openai/gpt-4o-mini | -1037.929.9% mean gap | 45,761,300 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 72.4 ptChicken Wrap |
| 22 | anthropic/claude-sonnet-4.6 | -1091.530.1% mean gap | 1,866,927 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 65.6 ptPickle Sandwich |
| 23 | anthropic/claude-opus-4.6 | -1111.130.8% mean gap | 1,869,241 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 70.2 ptCigarette Sandwich |
| 24 | openai/o3 | -1121.127.5% mean gap | 2,931,696 | 3,080 | 154 | 3.7 ptBacon Lettuce Tomato | 51.4 ptBagel PB&J |
| 25 | openai/gpt-4o-2024-11-20 | -1174.426.0% mean gap | 2,591,828 | 3,080 | 154 | 3.7 ptBacon Lettuce Tomato | 65.6 ptPickle Sandwich |
| 26 | meta-llama/llama-4-maverick | -1193.124.5% mean gap | 4,064,336 | 3,080 | 154 | 2.4 ptHot Dog | 77.4 ptChicken Wrap |
| 27 | baidu/ernie-4.5-vl-28b-a3b | -1227.332.4% mean gap | 1,888,896 | 2,000 | 100 | 3.7 ptBacon Lettuce Tomato | 73.0 ptHamburger |
| 28 | qwen/qwen2.5-vl-72b-instruct | -1229.826.0% mean gap | 2,903,489 | 3,080 | 154 | 3.7 ptBacon Lettuce Tomato | 54.2 ptKitten in Bread |
| 29 | minimax/minimax-01 | -1384.026.5% mean gap | 15,741,394 | 3,080 | 154 | 0.1 ptBagel PB&J | 77.4 ptChicken Wrap |
| 30 | amazon/nova-pro-v1 | -1468.628.4% mean gap | 5,300,428 | 3,080 | 154 | 1.9 ptPickle Sandwich | 75.5 ptChicken Wrap |
| 31 | openai/gpt-4.1 | -1477.428.4% mean gap | 2,604,300 | 3,080 | 154 | 3.7 ptBacon Lettuce Tomato | 65.6 ptPickle Sandwich |
| 32 | google/gemini-3-flash-preview | -1545.029.1% mean gap | 3,915,544 | 3,100 | 155 | 3.7 ptBacon Lettuce Tomato | 65.6 ptPickle Sandwich |
Each iteration is 20 images. Partial runs are shown as fractional iterations when a model did not finish the full workload.
Failure atlas
Which photos broke the models hardest
mistralai/pixtral-large-241177.4 pt
amazon/nova-lite-v177.4 pt
openai/gpt-5.477.4 pt
meta-llama/llama-4-maverick77.4 pt
minimax/minimax-0177.4 pt
openai/gpt-5.4-pro75.6 pt