Models

Model-by-model sandwich analytics

This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.

The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.

Under development: this benchmark and its published results are provisional, not final.

Best gap#1
Meta / Llamameta-llama/llama-3.2-11b-vision-instruct
Crowd gap16.3%
Score-53.6
16.3%mean gap
-53.6score
Widest miss32.4%
Baidu / ERNIEbaidu/ernie-4.5-vl-28b-a3b
Crowd gap32.4%
Rank#27
-1227.3score
$0.30cost
Cheapest$0.00
GPT / OpenAIGPT-4o (2024 run)
Spend$0.00
Prompt sharen/a
0tokens
0retries
Highest spend$17.52
Geminigoogle/gemini-3.1-pro-preview
Spend$17.52
Rank#10
3.1Mtokens
11retries
Most tokens45.8M
GPT / OpenAIopenai/gpt-4o-mini
Token load45.8M
Prompt share99.8%
$3.41cost
0retries
Most retries227
Geminigoogle/gemini-2.5-pro
Retry load227
Completion100.0%
$16.85spend
4.3Mtokens
Leaderboard

Alignment, total workload, and rank in one table

RankModelAlignmentTokensImages EvaluatedTotal Eval RunsBest FitWorst Miss
1Meta / Llamameta-llama/llama-3.2-11b-vision-instruct
-53.616.3% mean gap
3,177,118
700
35
45.8 ptCookie PB
2GPT / OpenAIGPT-4o (2024 run)
-269.322.1% mean gap
0
1,000
50
n/a
n/aNo image gap
3GPT / OpenAIopenai/o3-pro
-330.828.5% mean gap
757,641
720
36
4Kimi / Moonshotmoonshotai/kimi-k2.5
-373.922.8% mean gap
2,216,641
1,220
61
5Grok / xAIx-ai/grok-4-fast
-408.317.0% mean gap
1,599,381
1,980
99
53.4 ptBagel PB&J
6Qwenqwen/qwen3.5-397b-a17b
-422.525.1% mean gap
1,723,128
1,100
55
7Geminigoogle/gemini-2.5-pro
-604.325.0% mean gap
4,307,134
1,620
81
8GPT / OpenAIopenai/gpt-5.4-pro
-632.031.1% mean gap
1,146,134
1,120
56
9GPT / OpenAIopenai/gpt-4o
-641.122.4% mean gap
1,677,512
2,000
100
60.2 ptHot Dog
10Geminigoogle/gemini-3.1-pro-preview
-747.327.5% mean gap
3,090,040
1,600
80
11Googlegoogle/gemma-3-12b-it
-752.325.1% mean gap
874,701
2,000
100
12Qwenqwen/qwen-2-vl-72b-instruct
-760.024.7% mean gap
1,888,854
2,000
100
13Meta / Llamameta-llama/llama-4-scout
-789.825.1% mean gap
2,654,272
2,000
100
53.4 ptBagel PB&J
14Qwenqwen/qwen2.5-vl-32b-instruct
-866.126.5% mean gap
1,905,598
2,000
100
15Googlegoogle/gemma-3-27b-it
-940.628.6% mean gap
863,650
2,000
100
16Pixtral / Mistralmistralai/pixtral-large-2411
-954.926.5% mean gap
3,964,789
2,000
100
17Amazonamazon/nova-lite-v1
-962.428.6% mean gap
3,447,484
2,000
100
18GPT / OpenAIopenai/gpt-4.1-mini
-1006.723.6% mean gap
3,548,876
3,100
155
19Z.AI / GLMz-ai/glm-4.6v
-1014.726.9% mean gap
2,712,263
2,360
118
20GPT / OpenAIopenai/gpt-5.4
-1018.728.8% mean gap
1,767,810
2,000
100
21GPT / OpenAIopenai/gpt-4o-mini
-1037.929.9% mean gap
45,761,300
2,000
100
22Claudeanthropic/claude-sonnet-4.6
-1091.530.1% mean gap
1,866,927
2,000
100
23Claudeanthropic/claude-opus-4.6
-1111.130.8% mean gap
1,869,241
2,000
100
24GPT / OpenAIopenai/o3
-1121.127.5% mean gap
2,931,696
3,080
154
51.4 ptBagel PB&J
25GPT / OpenAIopenai/gpt-4o-2024-11-20
-1174.426.0% mean gap
2,591,828
3,080
154
26Meta / Llamameta-llama/llama-4-maverick
-1193.124.5% mean gap
4,064,336
3,080
154
2.4 ptHot Dog
27Baidu / ERNIEbaidu/ernie-4.5-vl-28b-a3b
-1227.332.4% mean gap
1,888,896
2,000
100
73.0 ptHamburger
28Qwenqwen/qwen2.5-vl-72b-instruct
-1229.826.0% mean gap
2,903,489
3,080
154
29MiniMaxminimax/minimax-01
-1384.026.5% mean gap
15,741,394
3,080
154
30Amazonamazon/nova-pro-v1
-1468.628.4% mean gap
5,300,428
3,080
154
31GPT / OpenAIopenai/gpt-4.1
-1477.428.4% mean gap
2,604,300
3,080
154
32Geminigoogle/gemini-3-flash-preview
-1545.029.1% mean gap
3,915,544
3,100
155

Each iteration is 20 images. Partial runs are shown as fractional iterations when a model did not finish the full workload.