Benchmark Results

Finding the Best AI for Product Descriptions

We pitted 8 AI models against each other across 14 real products using human preference voting. 305 head-to-head comparisons decided the winner.

🗳️

305

Human Votes

📦

Products Tested

🤖

Models Compared

ELO Ranking

Average ELO score across all products. Starting ELO is 1400.

1 Gemini Flash

1441

single

2 Gemini→Claude

1426

pipeline

3 GPT-5 Nano

1417

single

4 Claude Sonnet 4

1416

single

5 GPT-4o Mini→4o

1415

pipeline

6 Qwen 3.5

1393

single

7 Mistral Small

1363

single

8 GPT-4o

1328

single

Win Rate

Percentage of head-to-head matchups won across all products.

Gemini Flash

88.5%

46W / 6L

Gemini→Claude

71.9%

41W / 16L

GPT-5 Nano

65.5%

36W / 19L

Claude Sonnet 4

63.5%

33W / 19L

GPT-4o Mini→4o

66.7%

30W / 15L

Qwen 3.5

45.5%

30W / 36L

Mistral Small

23.9%

16W / 51L

GPT-4o

1.4%

1W / 71L

Votes Won

Total times each model was preferred by human voters.

Gemini Flash

Gemini→Claude

GPT-5 Nano

Claude Sonnet 4

GPT-4o Mini→4o

Qwen 3.5

Mistral Small

GPT-4o

Performance by Product

ELO score per model per product. Colors show relative rank within each product row — green = top performer, red = bottom.

Product	Gemini Flash	Gemini→Claude	GPT-5 Nano	Claude Sonnet 4	GPT-4o Mini→4o	Qwen 3.5	Mistral Small	GPT-4o
Rollschuhe	1498	1405	1453	1404	1426	1328	1357	1329
Cars Carrara Bahn	1497	1385	1383	1416	1374	1401	1416	1328
Radon Trekkingrad Solution Sport 3	1484	1403	1381	1430	1414	1380	1357	1351
Schreibtisch	1431	1474	1429	1377	1427	1370	1326	1366
Kinderwagen	1460	1431	1428	1416	1416	1382	1343	1324
Petroleum Lampe	1457	1416	1417	1404	1442	1394	1341	1329
Keramik-Wanduhr	1430	1432	1455	1356	1368	1445	1383	1331
Zangen Set	1419	1440	1443	1384	1455	1408	1357	1294
Waschmaschine	1413	1428	1432	1384	1431	1416	1382	1314
Roter Sessel	1443	1399	1385	1444	1416	1415	1356	1342
Monopoly Bremen	1399	1417	1448	1447	1412	1383	1395	1299
Bosch Bohrmaschine	1413	1442	1456	1458	1386	1385	1322	1338
Bowleset mit Tassen	1404	1482	1348	1458	1415	1383	1366	1344
Pegasus Fahrrad	1429	1409	1384	1446	1433	1413	1385	1301

Methodology

ELO Rating

All models start at ELO 1400. Each head-to-head vote adjusts both scores using the standard ELO formula (K=32). Higher ELO = stronger overall preference.

Blind Voting

Voters see two descriptions side-by-side without knowing which model generated each. They pick the one they prefer as a product description.

Coverage

Each model generated descriptions for all 14 products. Every possible pair of descriptions was presented as a matchup, covering all head-to-head combinations.

Data snapshot: 2026-06-03 · 305 votes across 14 products