Benchmark Results
Finding the Best AI for Product Descriptions
We pitted 8 AI models against each other across 14 real products using human preference voting. 305 head-to-head comparisons decided the winner.
🗳️
305
Human Votes
📦
14
Products Tested
🤖
8
Models Compared
ELO Ranking
Average ELO score across all products. Starting ELO is 1400.
1 Gemini Flash single
1441
2 Gemini→Claude pipeline
1426
3 GPT-5 Nano single
1417
4 Claude Sonnet 4 single
1416
5 GPT-4o Mini→4o pipeline
1415
6 Qwen 3.5 single
1393
7 Mistral Small single
1363
8 GPT-4o single
1328
Win Rate
Percentage of head-to-head matchups won across all products.
Gemini Flash
88.5%
46W / 6L
Gemini→Claude
71.9%
41W / 16L
GPT-5 Nano
65.5%
36W / 19L
Claude Sonnet 4
63.5%
33W / 19L
GPT-4o Mini→4o
66.7%
30W / 15L
Qwen 3.5
45.5%
30W / 36L
Mistral Small
23.9%
16W / 51L
GPT-4o
1.4%
1W / 71L
Votes Won
Total times each model was preferred by human voters.
59
48
53
50
41
30
16
5
Gemini Flash
Gemini→Claude
GPT-5 Nano
Claude Sonnet 4
GPT-4o Mini→4o
Qwen 3.5
Mistral Small
GPT-4o
Performance by Product
ELO score per model per product. Colors show relative rank within each product row — green = top performer, red = bottom.
| Product | Gemini Flash | Gemini→Claude | GPT-5 Nano | Claude Sonnet 4 | GPT-4o Mini→4o | Qwen 3.5 | Mistral Small | GPT-4o |
|---|---|---|---|---|---|---|---|---|
| Rollschuhe | 1498 | 1405 | 1453 | 1404 | 1426 | 1328 | 1357 | 1329 |
| Cars Carrara Bahn | 1497 | 1385 | 1383 | 1416 | 1374 | 1401 | 1416 | 1328 |
| Radon Trekkingrad Solution Sport 3 | 1484 | 1403 | 1381 | 1430 | 1414 | 1380 | 1357 | 1351 |
| Schreibtisch | 1431 | 1474 | 1429 | 1377 | 1427 | 1370 | 1326 | 1366 |
| Kinderwagen | 1460 | 1431 | 1428 | 1416 | 1416 | 1382 | 1343 | 1324 |
| Petroleum Lampe | 1457 | 1416 | 1417 | 1404 | 1442 | 1394 | 1341 | 1329 |
| Keramik-Wanduhr | 1430 | 1432 | 1455 | 1356 | 1368 | 1445 | 1383 | 1331 |
| Zangen Set | 1419 | 1440 | 1443 | 1384 | 1455 | 1408 | 1357 | 1294 |
| Waschmaschine | 1413 | 1428 | 1432 | 1384 | 1431 | 1416 | 1382 | 1314 |
| Roter Sessel | 1443 | 1399 | 1385 | 1444 | 1416 | 1415 | 1356 | 1342 |
| Monopoly Bremen | 1399 | 1417 | 1448 | 1447 | 1412 | 1383 | 1395 | 1299 |
| Bosch Bohrmaschine | 1413 | 1442 | 1456 | 1458 | 1386 | 1385 | 1322 | 1338 |
| Bowleset mit Tassen | 1404 | 1482 | 1348 | 1458 | 1415 | 1383 | 1366 | 1344 |
| Pegasus Fahrrad | 1429 | 1409 | 1384 | 1446 | 1433 | 1413 | 1385 | 1301 |
Methodology
ELO Rating
All models start at ELO 1400. Each head-to-head vote adjusts both scores using the standard ELO formula (K=32). Higher ELO = stronger overall preference.Blind Voting
Voters see two descriptions side-by-side without knowing which model generated each. They pick the one they prefer as a product description.Coverage
Each model generated descriptions for all 14 products. Every possible pair of descriptions was presented as a matchup, covering all head-to-head combinations.Data snapshot: 2026-06-03 · 305 votes across 14 products