Models

Model-by-model sandwich analytics

This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.

The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.

The primary view on this page is now the percent forecast benchmark. The older binary posterior view is still here, but it lives behind the second tab so the two benchmarks can coexist without being blended into one score.

Under development: this benchmark and its published results are provisional, not final.

Percent leader#1
Probabilistic score72.8
Crowd gap9.4%
72.8primary score
90.6%crowd match
Tightest confidence66.4 pt
Probabilistic score72.8
Crowd gap9.4%
24.6CI low
91.0CI high
Official cohort60 rated
Pending official rating5
Total tracked models65
25runs for official
Brier v2primary method
Highest spend$15.10
Benchmark score59.5
OfficialPending
$15.10total cost
354.2Ktokens
Best gap#1
Crowd gap24.7%
Score-42.6
24.7%mean gap
-42.6score
Highest spend$167.55
Spend$167.55
Rank#1
407.3Ktokens
$16.76cost / run
Leaderboard

Alignment, total workload, and rank in one table

RankModelAlignmentTokensImages EvaluatedTotal Eval RunsBest FitWorst MissTotal Cost
馃 1
-42.6
407,342
200
10
$167.55
馃 2
-44.2
481,530
220
11
$21.01
馃 3
-234.5
9,345,504
2,000
100
44.5 ptCookie PB
$0.50
4
-269.3
~838,756
1,000
50
$35.00 (estimate)
5
-414.0
1,600,190
2,000
100
53.4 ptBagel PB&J
$0.53
6
-441.5
2,212,827
2,000
100
$0.44
7
-494.7
4,293,988
2,000
100
52.4 ptBagel PB&J
$1.14
8
-537.6
1,482,379
2,000
100
$3.00
9
-568.4
3,009,358
2,000
100
1.5 ptCookie PB
53.4 ptBagel PB&J
$0.77
10
-576.6
6,578,435
2,000
100
$0.00
11
-593.6
2,885,005
2,000
100
53.4 ptBagel PB&J
$3.41
12
-601.1
2,950,739
2,000
100
$15.67
13
-618.1
3,628,492
2,000
100
$10.46
14
-618.8
3,760,860
2,000
100
$2.44
15
-641.1
1,677,512
2,000
100
60.2 ptHot Dog
$4.90
16
-677.0
1,745,414
2,000
100
$0.50
17
-679.0
912,414
2,000
100
$1.47
18
-680.1
3,036,535
2,000
100
0.8 ptHot Dog
$3.75
19
-685.5
3,458,125
2,000
100
1.8 ptHot Dog
$0.50
20
-690.0
2,147,071
2,000
100
$0.25
21
-691.3
1,636,294
2,000
100
0.2 ptHot Dog
$2.47
22
-710.9
1,971,808
2,000
100
$1.03
23
-715.0
2,559,505
2,000
100
53.4 ptBagel PB&J
$1.97
24
-723.1
1,652,963
2,000
100
$1.12
25
-738.7
5,266,087
2,000
100
$38.80
26
-752.3
874,701
2,000
100
$0.20
27
-752.4
2,422,927
2,000
100
$0.29
28
-760.0
1,888,854
2,000
100
$1.51
30
-789.8
2,654,272
2,000
100
53.4 ptBagel PB&J
$0.31
31
-816.7
3,027,614
2,000
100
$6.98
32
-831.8
2,976,068
2,000
100
$2.75
33
-834.3
3,295,996
2,000
100
$0.33
34
-838.6
1,476,483
2,000
100
$2.94
35
-866.1
1,905,598
2,000
100
$0.42
36
-883.6
2,231,230
2,000
100
$1.38
37
-883.8
1,969,548
2,000
100
$16.09
38
-886.9
1,839,494
2,000
100
$4.71
39
-928.1
3,822,589
2,000
100
3.5 ptCookie PB
$28.64
40
-939.9
1,521,123
2,000
100
$0.48
41
-940.6
863,650
2,000
100
$0.14
42
-953.5
1,461,307
2,000
100
$2.79
43
-954.9
3,964,789
2,000
100
$10.07
44
-962.4
3,447,484
2,000
100
$0.22
45
-963.7
1,846,916
2,000
100
$2.37
46
-984.4
2,066,153
2,000
100
$85.07
47
-985.7
840,428
2,000
100
$0.12
48
-989.4
2,520,037
2,000
100
0.8 ptHot Dog
$0.78
49
-989.7
1,853,338
2,000
100
$12.03
50
-1006.7
3,548,876
3,100
155
$1.45
51
-1014.7
2,712,263
2,360
118
$1.04
52
-1018.7
1,767,810
2,000
100
$5.54
53
-1029.5
1,980,100
2,000
100
$0.80
54
-1037.9
45,761,300
2,000
100
$7.00
55
-1091.5
1,866,927
2,000
100
$9.05
56
-1111.1
1,869,241
2,000
100
$15.45
57
-1121.1
2,931,696
3,080
154
51.4 ptBagel PB&J
$10.16
58
-1133.1
2,040,637
2,000
100
$124.93
59
-1169.9
1,518,179
2,000
100
$0.27
60
-1174.4
2,591,828
3,080
154
$7.48
61
-1193.1
4,064,336
3,080
154
2.4 ptHot Dog
$1.00
62
-1227.3
1,888,896
2,000
100
73.0 ptHamburger
$0.30
63
-1229.8
2,903,489
3,080
154
$3.07
64
-1278.7
1,912,839
2,000
100
$1.04
65
-1384.0
15,741,394
3,080
154
$3.24
66
-1468.6
5,300,428
3,080
154
$4.49
67
-1477.4
2,604,300
3,080
154
$6.08
68
-1545.0
3,915,544
3,100
155
$2.45
-
TotalAll models
-
227,565,630
141,540
7,077
--
$726.34

Each iteration is 20 images. Partial runs are shown as fractional iterations when a model did not finish the full workload.

AI Model Sandwich Benchmark Rankings | opensandwich.ai