Tiny eval. Huge sandwich energy.

OpenSandwich.ai

A deliberately low-stakes benchmark for a real alignment problem: can models recover a fuzzy human category when the category is lunch, the humans disagree, and the edge cases get deeply stupid?

This benchmark is deliberately small and scientifically annoying: twenty photos, one binary judgment, and a category boundary that humans themselves fail to stabilize. That is not a bug. It is the whole experiment.

In other words, we are stress-testing multimodal reasoning with an open-faced sandwich, a hostile ontology, and a crowd baseline that occasionally wakes up and chooses chaos. If a model cannot survive this, it probably should not sound so smug elsewhere.

Under development: this benchmark and its published results are provisional, not final.

Take the survey Explore the benchmark See the rankings

Tokens toasted: 226.7M

Total requests: 155.5K

Human judgments: 13.1K

Model judgments: 141.5K

Spend trackingPublished benchmark spend$790.90 Total Inference Cost.

You can vote: Live survey

Benchmark

20 images, from clean sandwiches to cases that make lunch law collapse.

Protocol

Humans and models get the same blunt question: is this a sandwich or not?

Scoring

Repeated runs turn one-off guesses into a ranking that survives variance.

Signal

If a model fumbles this category, its confidence elsewhere deserves scrutiny.

Ranking snapshot

Percent Forecast Benchmark Ratings

HumanHuman

100.0

ReferenceConfidence

100.0%Crowd match

🥇 1

openai/o3

72.8

HighConfidence

90.6%Crowd match

🥈 2

openai/gpt-5.1

72.1

HighConfidence

90.8%Crowd match

🥉 3

anthropic/claude-opus-4.5

71.0

HighConfidence

89.5%Crowd match

openai/gpt-5.4-pro

65.9

HighConfidence

88.0%Crowd match

openai/gpt-5.1-codex

64.3

HighConfidence

90.3%Crowd match

openai/gpt-4.1

64.2

HighConfidence

87.7%Crowd match

anthropic/claude-opus-4.6

63.7

HighConfidence

87.1%Crowd match

openai/gpt-5.1-chat

63.5

HighConfidence

89.4%Crowd match

x-ai/grok-4

62.1

HighConfidence

88.3%Crowd match

openai/gpt-4o

59.5

HighConfidence

87.9%Crowd match

openai/o1

59.5

HighConfidence

87.4%Crowd match

qwen/qwen3.5-122b-a10b

58.0

HighConfidence

87.6%Crowd match

openai/gpt-4o-2024-11-20

57.4

HighConfidence

87.8%Crowd match

anthropic/claude-haiku-4.5

56.1

HighConfidence

86.2%Crowd match

anthropic/claude-sonnet-4.6

54.5

HighConfidence

86.0%Crowd match

google/gemini-3.1-pro-preview

52.0

HighConfidence

85.0%Crowd match

openrouter/healer-alpha

51.4

HighConfidence

88.3%Crowd match

bytedance-seed/seed-2.0-mini

50.7

HighConfidence

87.2%Crowd match

19nvidia/nemotron-nano-12b-v2-vl

50.6

HighConfidence

88.7%Crowd match

google/gemini-3-flash-preview

46.2

HighConfidence

84.6%Crowd match

moonshotai/kimi-k2.5

46.2

HighConfidence

85.8%Crowd match

google/gemma-3-27b-it

42.0

HighConfidence

83.7%Crowd match

openai/gpt-5.2

38.7

HighConfidence

84.7%Crowd match

qwen/qwen3.5-27b

38.2

HighConfidence

85.3%Crowd match

google/gemini-2.5-pro

34.4

HighConfidence

84.0%Crowd match

perplexity/sonar-pro-search

32.6

HighConfidence

87.0%Crowd match

mistralai/pixtral-large-2411

32.5

HighConfidence

85.1%Crowd match

bytedance-seed/seed-2.0-lite

32.0

MediumConfidence

82.6%Crowd match

qwen/qwen3.5-plus-02-15

29.0

HighConfidence

82.9%Crowd match

google/gemini-3.1-flash-lite-preview

28.4

MediumConfidence

82.2%Crowd match

qwen/qwen3.5-35b-a3b

28.2

HighConfidence

83.4%Crowd match

qwen/qwen3-vl-30b-a3b-thinking

27.1

HighConfidence

84.2%Crowd match

qwen/qwen3.5-flash-02-23

24.5

HighConfidence

81.3%Crowd match

google/gemini-3-pro-image-preview

24.3

HighConfidence

83.7%Crowd match

openai/gpt-4.1-mini

23.0

MediumConfidence

82.1%Crowd match

bytedance-seed/seed-1.6-flash

20.6

HighConfidence

82.0%Crowd match

x-ai/grok-4.20-beta

20.3

MediumConfidence

81.5%Crowd match

qwen/qwen3.5-397b-a17b

19.7

MediumConfidence

83.2%Crowd match

z-ai/glm-4.6v

19.5

MediumConfidence

81.3%Crowd match

allenai/molmo-2-8b

18.1

HighConfidence

83.5%Crowd match

openai/gpt-5.4

17.7

MediumConfidence

81.0%Crowd match

meta-llama/llama-4-scout

17.0

MediumConfidence

79.3%Crowd match

qwen/qwen2.5-vl-72b-instruct

16.1

MediumConfidence

81.0%Crowd match

google/gemini-3.1-flash-image-preview

15.3

MediumConfidence

80.0%Crowd match

google/gemini-2.5-flash

11.2

MediumConfidence

79.1%Crowd match

openai/gpt-4o-mini

11.1

MediumConfidence

79.2%Crowd match

mistralai/mistral-large-2512

10.8

MediumConfidence

78.6%Crowd match

qwen/qwen-2-vl-72b-instruct

10.4

MediumConfidence

80.6%Crowd match

bytedance-seed/seed-1.6

9.4

LowConfidence

83.2%Crowd match

x-ai/grok-4-fast

8.8

MediumConfidence

84.2%Crowd match

meta-llama/llama-4-maverick

6.1

MediumConfidence

79.1%Crowd match

x-ai/grok-4.1-fast

2.6

MediumConfidence

81.4%Crowd match

google/gemma-3-12b-it

0.1

MediumConfidence

79.8%Crowd match

minimax/minimax-01

-5.4

LowConfidence

78.3%Crowd match

qwen/qwen3-vl-235b-a22b-instruct

-5.6

LowConfidence

78.7%Crowd match

qwen/qwen2.5-vl-32b-instruct

-6.9

MediumConfidence

81.5%Crowd match

qwen/qwen3-vl-30b-a3b-instruct

-12.5

LowConfidence

77.7%Crowd match

amazon/nova-pro-v1

-13.5

MediumConfidence

76.6%Crowd match

qwen/qwen3.5-9b

-14.7

LowConfidence

78.8%Crowd match

google/gemini-2.5-flash-lite

-28.5

LowConfidence

74.5%Crowd match

amazon/nova-lite-v1

-36.5

LowConfidence

74.8%Crowd match

amazon/nova-2-lite-v1

-57.4

LowConfidence

71.9%Crowd match

baidu/ernie-4.5-vl-28b-a3b

-69.8

LowConfidence

69.6%Crowd match

openai/gpt-4.1-nano

-91.7

LowConfidence

70.6%Crowd match

meta-llama/llama-3.2-11b-vision-instruct

-178.0

LowConfidence

69.3%Crowd match

Rank	Model	Score	Confidence	Crowd match
Human	Human	100.0	Reference	100.0%
🥇 1	openai/o3	72.8	High	90.6%
🥈 2	openai/gpt-5.1	72.1	High	90.8%
🥉 3	anthropic/claude-opus-4.5	71.0	High	89.5%
4	openai/gpt-5.4-pro	65.9	High	88.0%
5	openai/gpt-5.1-codex	64.3	High	90.3%
6	openai/gpt-4.1	64.2	High	87.7%
7	anthropic/claude-opus-4.6	63.7	High	87.1%
8	openai/gpt-5.1-chat	63.5	High	89.4%
9	x-ai/grok-4	62.1	High	88.3%
10	openai/gpt-4o	59.5	High	87.9%
11	openai/o1	59.5	High	87.4%
12	qwen/qwen3.5-122b-a10b	58.0	High	87.6%
13	openai/gpt-4o-2024-11-20	57.4	High	87.8%
14	anthropic/claude-haiku-4.5	56.1	High	86.2%
15	anthropic/claude-sonnet-4.6	54.5	High	86.0%
16	google/gemini-3.1-pro-preview	52.0	High	85.0%
17	openrouter/healer-alpha	51.4	High	88.3%
18	bytedance-seed/seed-2.0-mini	50.7	High	87.2%
19	nvidia/nemotron-nano-12b-v2-vl	50.6	High	88.7%
20	google/gemini-3-flash-preview	46.2	High	84.6%
21	moonshotai/kimi-k2.5	46.2	High	85.8%
22	google/gemma-3-27b-it	42.0	High	83.7%
23	openai/gpt-5.2	38.7	High	84.7%
24	qwen/qwen3.5-27b	38.2	High	85.3%
25	google/gemini-2.5-pro	34.4	High	84.0%
26	perplexity/sonar-pro-search	32.6	High	87.0%
27	mistralai/pixtral-large-2411	32.5	High	85.1%
28	bytedance-seed/seed-2.0-lite	32.0	Medium	82.6%
29	qwen/qwen3.5-plus-02-15	29.0	High	82.9%
30	google/gemini-3.1-flash-lite-preview	28.4	Medium	82.2%
31	qwen/qwen3.5-35b-a3b	28.2	High	83.4%
32	qwen/qwen3-vl-30b-a3b-thinking	27.1	High	84.2%
33	qwen/qwen3.5-flash-02-23	24.5	High	81.3%
34	google/gemini-3-pro-image-preview	24.3	High	83.7%
35	openai/gpt-4.1-mini	23.0	Medium	82.1%
36	bytedance-seed/seed-1.6-flash	20.6	High	82.0%
37	x-ai/grok-4.20-beta	20.3	Medium	81.5%
38	qwen/qwen3.5-397b-a17b	19.7	Medium	83.2%
39	z-ai/glm-4.6v	19.5	Medium	81.3%
40	allenai/molmo-2-8b	18.1	High	83.5%
41	openai/gpt-5.4	17.7	Medium	81.0%
42	meta-llama/llama-4-scout	17.0	Medium	79.3%
43	qwen/qwen2.5-vl-72b-instruct	16.1	Medium	81.0%
44	google/gemini-3.1-flash-image-preview	15.3	Medium	80.0%
45	google/gemini-2.5-flash	11.2	Medium	79.1%
46	openai/gpt-4o-mini	11.1	Medium	79.2%
47	mistralai/mistral-large-2512	10.8	Medium	78.6%
48	qwen/qwen-2-vl-72b-instruct	10.4	Medium	80.6%
49	bytedance-seed/seed-1.6	9.4	Low	83.2%
50	x-ai/grok-4-fast	8.8	Medium	84.2%
51	meta-llama/llama-4-maverick	6.1	Medium	79.1%
52	x-ai/grok-4.1-fast	2.6	Medium	81.4%
53	google/gemma-3-12b-it	0.1	Medium	79.8%
54	minimax/minimax-01	-5.4	Low	78.3%
55	qwen/qwen3-vl-235b-a22b-instruct	-5.6	Low	78.7%
56	qwen/qwen2.5-vl-32b-instruct	-6.9	Medium	81.5%
57	qwen/qwen3-vl-30b-a3b-instruct	-12.5	Low	77.7%
58	amazon/nova-pro-v1	-13.5	Medium	76.6%
59	qwen/qwen3.5-9b	-14.7	Low	78.8%
60	google/gemini-2.5-flash-lite	-28.5	Low	74.5%
61	amazon/nova-lite-v1	-36.5	Low	74.8%
62	amazon/nova-2-lite-v1	-57.4	Low	71.9%
63	baidu/ernie-4.5-vl-28b-a3b	-69.8	Low	69.6%
64	openai/gpt-4.1-nano	-91.7	Low	70.6%
65	meta-llama/llama-3.2-11b-vision-instruct	-178.0	Low	69.3%

Cookie PB

51.5%

A photo that turns a simple lunch question into a philosophical incident.

Bagel PB&J

46.6%

A photo that turns a simple lunch question into a philosophical incident.

Kitten in Bread

54.2%

A photo that turns a simple lunch question into a philosophical incident.

Fault Lines

Benchmark images that expose the biggest cracks

These are the photos that cause the best arguments. Open any one to see the image, the human split, the model spread, and a few comments from both species.

Browse the open source benchmark

KTYHuman knife-edge

07. Kitten in Bread

A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped...

Human: 54.2% yes45.8% no
Model average: 8.7% yes91.3% no
Max gap: 54.2%
Closest model: openai/gpt-4.1-nano

WTFHuman knife-edge

20. Bagel PB&J

A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topol...

Human: 46.6% yes53.4% no
Model average: 83.4% yes16.6% no
Max gap: 53.4%
Closest model: minimax/minimax-01

WRPSplit concept

15. Chicken Wrap

A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandw...

Human: 22.6% yes77.4% no
Model average: 47.7% yes52.3% no
Max gap: 77.4%
Closest model: bytedance-seed/seed-2.0-lite

CPBHuman knife-edge

14. Cookie PB

Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit...

Human: 51.5% yes48.5% no
Model average: 34.8% yes65.3% no
Max gap: 51.5%
Closest model: qwen/qwen3.5-flash-02-23