Benchmark Results

Finding the Best AI for Product Descriptions

We pitted 8 AI models against each other across 14 real products using human preference voting. 305 head-to-head comparisons decided the winner.

🗳️
305
Human Votes
📦
14
Products Tested
🤖
8
Models Compared

ELO Ranking

Average ELO score across all products. Starting ELO is 1400.

1 Gemini Flash
1441
single
2 Gemini→Claude
1426
pipeline
3 GPT-5 Nano
1417
single
4 Claude Sonnet 4
1416
single
5 GPT-4o Mini→4o
1415
pipeline
6 Qwen 3.5
1393
single
7 Mistral Small
1363
single
8 GPT-4o
1328
single

Win Rate

Percentage of head-to-head matchups won across all products.

Gemini Flash
88.5%
46W / 6L
Gemini→Claude
71.9%
41W / 16L
GPT-5 Nano
65.5%
36W / 19L
Claude Sonnet 4
63.5%
33W / 19L
GPT-4o Mini→4o
66.7%
30W / 15L
Qwen 3.5
45.5%
30W / 36L
Mistral Small
23.9%
16W / 51L
GPT-4o
1.4%
1W / 71L

Votes Won

Total times each model was preferred by human voters.

59
48
53
50
41
30
16
5
Gemini Flash
Gemini→Claude
GPT-5 Nano
Claude Sonnet 4
GPT-4o Mini→4o
Qwen 3.5
Mistral Small
GPT-4o

Performance by Product

ELO score per model per product. Colors show relative rank within each product row — green = top performer, red = bottom.

Product
Gemini Flash
Gemini→Claude
GPT-5 Nano
Claude Sonnet 4
GPT-4o Mini→4o
Qwen 3.5
Mistral Small
GPT-4o
Rollschuhe
1498
1405
1453
1404
1426
1328
1357
1329
Cars Carrara Bahn
1497
1385
1383
1416
1374
1401
1416
1328
Radon Trekkingrad Solution Sport 3
1484
1403
1381
1430
1414
1380
1357
1351
Schreibtisch
1431
1474
1429
1377
1427
1370
1326
1366
Kinderwagen
1460
1431
1428
1416
1416
1382
1343
1324
Petroleum Lampe
1457
1416
1417
1404
1442
1394
1341
1329
Keramik-Wanduhr
1430
1432
1455
1356
1368
1445
1383
1331
Zangen Set
1419
1440
1443
1384
1455
1408
1357
1294
Waschmaschine
1413
1428
1432
1384
1431
1416
1382
1314
Roter Sessel
1443
1399
1385
1444
1416
1415
1356
1342
Monopoly Bremen
1399
1417
1448
1447
1412
1383
1395
1299
Bosch Bohrmaschine
1413
1442
1456
1458
1386
1385
1322
1338
Bowleset mit Tassen
1404
1482
1348
1458
1415
1383
1366
1344
Pegasus Fahrrad
1429
1409
1384
1446
1433
1413
1385
1301

Methodology

ELO Rating
All models start at ELO 1400. Each head-to-head vote adjusts both scores using the standard ELO formula (K=32). Higher ELO = stronger overall preference.
Blind Voting
Voters see two descriptions side-by-side without knowing which model generated each. They pick the one they prefer as a product description.
Coverage
Each model generated descriptions for all 14 products. Every possible pair of descriptions was presented as a matchup, covering all head-to-head combinations.

Data snapshot: 2026-06-03 · 305 votes across 14 products