Models

Model-by-model sandwich analytics

This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.

The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.

The primary view on this page is now the percent forecast benchmark. The older binary posterior view is still here, but it lives behind the second tab so the two benchmarks can coexist without being blended into one score.

Under development: this benchmark and its published results are provisional, not final.

Percent leader#1
Probabilistic score72.8
Crowd gap9.4%
72.8primary score
90.6%crowd match
Tightest confidence52.6 pt
Probabilistic score72.7
Crowd gap10.0%
35.5CI low
88.1CI high
Official cohort67 rated
Pending official rating5
Total tracked models72
25runs for official
Brier v2primary method
Highest spend$15.10
Benchmark score59.5
OfficialPending
$15.10total cost
354.2Ktokens
Ranking snapshot

Percent Forecast Benchmark Ratings

HumanHuman
Score100.0
Crowd match100.0%
ConfidenceReference
OfficialHuman
Runs0
Costn/a
🥇 1GPT / OpenAIopenai/o3
Score72.8
Crowd match90.6%
ConfidenceHigh
OfficialPending
Runs10
Cost$0.57
🥈 2openai/gpt-5.5
Score72.7
Crowd match90.0%
ConfidenceHigh
Official#9
Runs100
Cost$11.28
🥉 3GPT / OpenAIopenai/gpt-5.1
Score71.7
Crowd match90.6%
ConfidenceHigh
Official#12
Runs67.3
Cost$0.69
Score64.0
Crowd match87.6%
ConfidenceHigh
Official#17
Runs67.4
Cost$1.00
Score62.1
Crowd match88.3%
ConfidenceHigh
Official#21
Runs31.8
Cost$3.40
Score60.1
Crowd match88.1%
ConfidenceHigh
Official#25
Runs67
Cost$1.19
Score59.5
Crowd match87.4%
ConfidenceHigh
OfficialPending
Runs10
Cost$15.10
Score39.0
Crowd match84.7%
ConfidenceHigh
Official#23
Runs66.9
Cost$1.18
Score38.2
Crowd match85.3%
ConfidenceHigh
Official#8
Runs30.7
Cost$0.73
Score20.1
Crowd match81.8%
ConfidenceMedium
Official#60
Runs67.3
Cost$0.24
Score18.4
Crowd match81.3%
ConfidenceMedium
Official#43
Runs67.5
Cost$1.21
Score10.0
Crowd match84.1%
ConfidenceMedium
Official#32
Runs66.9
Cost$0.11
Score-12.2
Crowd match76.7%
ConfidenceMedium
Official#11
Runs67.5
Cost$0.77
Score-14.7
Crowd match78.8%
ConfidenceMedium
Official#54
Runs40.9
Cost$0.06
RankModelScoreConfidenceCrowd MatchOfficialTotal Eval RunsTokensTotal Cost
HumanHuman
100.0
Reference
100.0%
Human
0
0
n/a
🥇 1GPT / OpenAIopenai/o3
72.8
High
90.6%
Pending
10
171,761
$0.57
🥈 2openai/gpt-5.5
72.7
High
90.0%
#9
100
1,766,736
$11.28
🥉 3GPT / OpenAIopenai/gpt-5.1
71.7
High
90.6%
#12
67.3
367,711
$0.69
4Claudeanthropic/claude-opus-4.5
70.7
High
89.5%
#29
67.4
481,888
$3.14
5anthropic/claude-opus-4.8
66.0
High
87.9%
#39
100
1,969,891
$13.03
6GPT / OpenAIopenai/gpt-5.4-pro
65.6
High
88.0%
#14
66.2
494,054
$0.14
7GPT / OpenAIopenai/gpt-4.1
64.0
High
87.6%
#17
67.4
431,738
$1.00
8Claudeanthropic/claude-opus-4.6
63.9
High
87.1%
#31
67.1
479,703
$3.18
9GPT / OpenAIopenai/gpt-5.1-chat
63.9
High
89.7%
#6
66.9
366,366
$0.72
10GPT / OpenAIopenai/gpt-5.1-codex
63.3
High
90.4%
#4
67.2
366,915
$0.69
11Grok / xAIx-ai/grok-4
62.1
High
88.3%
#21
31.8
621,741
$3.40
12anthropic/claude-opus-4.7
61.9
High
86.7%
#41
100
1,984,271
$13.18
13GPT / OpenAIopenai/gpt-4o
60.1
High
88.1%
#25
67
421,370
$1.19
14GPT / OpenAIopenai/o1
59.5
High
87.4%
Pending
10
354,187
$15.10
15GPT / OpenAIopenai/gpt-4o-2024-11-20
58.3
High
88.2%
#24
67.3
428,023
$1.23
16Qwenqwen/qwen3.5-122b-a10b
57.5
High
87.5%
#5
67
824,124
$1.15
17Claudeanthropic/claude-haiku-4.5
56.2
High
86.2%
#47
67.4
479,031
$0.61
18Claudeanthropic/claude-sonnet-4.6
54.5
High
86.0%
#44
101
1,767,562
$6.99
19openai/gpt-5.3-chat
54.2
High
86.6%
#7
100
1,743,254
$4.15
20Geminigoogle/gemini-3.1-pro-preview
52.0
High
85.0%
Pending
11
362,486
$2.03
21OpenRouteropenrouter/healer-alpha
51.4
High
88.3%
#2
62.1
1,397,734
$0.00
22ByteDance Seedbytedance-seed/seed-2.0-mini
51.2
High
87.3%
#22
67.2
911,867
$0.14
23openai/gpt-5.3-codex
50.3
High
85.6%
#27
100
1,800,956
$4.89
24nvidia/nemotron-nano-12b-v2-vl
48.3
High
88.2%
#1
67.4
1,172,652
$0.32
25Geminigoogle/gemini-3-flash-preview
46.2
High
84.6%
#53
33.9
536,822
$0.33
26Kimi / Moonshotmoonshotai/kimi-k2.5
46.2
High
85.8%
Pending
28.8
601,658
$0.85
27Googlegoogle/gemma-3-27b-it
40.7
High
83.9%
#49
67.5
209,196
$0.01
28GPT / OpenAIopenai/gpt-5.2
39.0
High
84.7%
#23
66.9
466,548
$1.18
29Qwenqwen/qwen3.5-27b
38.2
High
85.3%
#8
30.7
660,238
$0.73
30Geminigoogle/gemini-2.5-pro
34.4
High
84.0%
#50
44.6
1,158,206
$4.60
31Perplexity / Sonarperplexity/sonar-pro-search
32.6
High
87.0%
Pending
10
24,153
$2.19
32Pixtral / Mistralmistralai/pixtral-large-2411
32.5
Medium
85.1%
#28
67.3
1,053,239
$1.97
33ByteDance Seedbytedance-seed/seed-2.0-lite
31.6
High
82.7%
#35
66.9
987,375
$0.65
34Qwenqwen/qwen3.5-plus-02-15
30.2
Medium
83.4%
#26
67.2
843,812
$0.84
35Geminigoogle/gemini-3.1-flash-lite-preview
28.4
High
82.2%
#59
34.1
532,366
$0.16
36Qwenqwen/qwen3.5-35b-a3b
28.2
Medium
83.4%
#30
31.1
591,082
$0.42
37Qwenqwen/qwen3-vl-30b-a3b-thinking
25.7
Medium
84.2%
#16
67.2
484,242
$0.24
38Qwenqwen/qwen3.5-flash-02-23
24.8
Medium
81.8%
#34
67.4
772,393
$0.19
39Geminigoogle/gemini-3-pro-image-preview
24.5
Medium
83.9%
#37
66.9
462,890
$3.78
40GPT / OpenAIopenai/gpt-4.1-mini
23.0
Medium
82.2%
#36
67
588,459
$0.25
41Grok / xAIx-ai/grok-4.20-beta
20.8
Medium
81.5%
#56
66.9
211,620
$0.35
42ByteDance Seedbytedance-seed/seed-1.6-flash
20.7
Medium
82.0%
#19
67.4
549,417
$0.06
43Z.AI / GLMz-ai/glm-4.6v
20.1
Medium
81.8%
#60
67.3
569,154
$0.24
44GPT / OpenAIopenai/gpt-5.4
18.4
Medium
81.3%
#43
67.5
460,948
$1.21
45Qwenqwen/qwen3.5-397b-a17b
18.1
Medium
83.0%
#20
66.1
662,122
$1.24
46AllenAI / Molmoallenai/molmo-2-8b
18.1
High
83.5%
#3
62.5
475,028
$0.09
47Meta / Llamameta-llama/llama-4-scout
17.1
Medium
79.4%
#66
67.3
691,315
$0.08
48Geminigoogle/gemini-3.1-flash-image-preview
16.5
Medium
80.7%
#40
67.3
288,671
$0.22
49Qwenqwen/qwen2.5-vl-72b-instruct
14.8
Medium
80.9%
#15
67.4
488,409
$0.39
50Qwenqwen/qwen-2-vl-72b-instruct
11.4
Medium
80.7%
#13
67
479,887
$0.38
51Geminigoogle/gemini-2.5-flash
11.2
Medium
79.1%
#51
34.5
788,489
$0.30
52GPT / OpenAIopenai/gpt-4o-mini
11.0
Medium
79.1%
#61
67.2
12,330,830
$1.84
53Mistralmistralai/mistral-large-2512
10.5
Medium
78.6%
#55
67.6
500,682
$0.26
54Grok / xAIx-ai/grok-4-fast
10.0
Medium
84.1%
#32
66.9
402,674
$0.11
55ByteDance Seedbytedance-seed/seed-1.6
9.8
Medium
83.1%
#64
67.3
547,948
$0.31
56openai/gpt-5.4-mini
9.4
Medium
82.0%
#33
100
1,669,425
$1.25
57Meta / Llamameta-llama/llama-4-maverick
5.6
Medium
79.2%
#52
67.2
696,525
$0.16
58Grok / xAIx-ai/grok-4.1-fast
1.7
Medium
81.2%
#48
67.2
490,245
$0.15
59Googlegoogle/gemma-3-12b-it
-0.1
Medium
79.9%
#45
66.9
203,665
$0.08
60MiniMaxminimax/minimax-01
-4.2
Medium
78.6%
#38
66.8
2,723,071
$0.56
61Qwenqwen/qwen3-vl-235b-a22b-instruct
-5.1
Low
78.9%
#65
67.2
382,199
$0.11
62Qwenqwen/qwen2.5-vl-32b-instruct
-6.0
Medium
81.4%
#18
67.1
480,639
$0.10
63Amazonamazon/nova-pro-v1
-12.2
Medium
76.7%
#11
67.5
918,094
$0.77
64Qwenqwen/qwen3-vl-30b-a3b-instruct
-13.0
Low
77.7%
#62
67.4
384,748
$0.07
65Qwenqwen/qwen3.5-9b
-14.7
Medium
78.8%
#54
40.9
529,777
$0.06
66Geminigoogle/gemini-2.5-flash-lite
-28.5
Low
74.5%
#67
34.5
774,167
$0.08
67openai/gpt-5.4-nano
-30.1
Low
76.5%
#63
100
1,688,675
$0.43
68Amazonamazon/nova-lite-v1
-35.2
Low
75.1%
#42
67
897,847
$0.06
69Amazonamazon/nova-2-lite-v1
-57.5
Low
72.3%
#58
67.5
521,311
$0.20
70Baidu / ERNIEbaidu/ernie-4.5-vl-28b-a3b
-69.8
Low
69.6%
#57
67.3
487,505
$0.08
71GPT / OpenAIopenai/gpt-4.1-nano
-87.4
Low
71.1%
#46
67.3
869,766
$0.09
72Meta / Llamameta-llama/llama-3.2-11b-vision-instruct
-170.9
Low
70.4%
#10
65.8
2,381,333
$0.12
AI Model Sandwich Benchmark Rankings | opensandwich.ai