These leaderboards contain the test-set scores of various models. To submit your own model's generations to the leaderboards, see Test Set Evaluation.
14 entries match your current filters.
Model | Context Size | Loose | Strict | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEURT | GPT Judge |
---|---|---|---|---|---|---|---|---|
GPT-4o (OpenAI, 2024) | 128,000 | 0.441 | 0.081 | 0.474 | 0.273 | 0.417 | 0.488 | 0.214 |
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) | 128,000 | 0.474 | 0.086 | 0.472 | 0.266 | 0.397 | 0.488 | 0.210 |
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) | 32,000 | 0.515 | 0.110 | 0.444 | 0.244 | 0.369 | 0.495 | 0.200 |
GPT-4-turbo (OpenAI, 2023) | 128,000 | 0.460 | 0.101 | 0.482 | 0.290 | 0.409 | 0.493 | 0.199 |
Claude 3 Opus (Anthropic, 2024) | 200,000 | 0.448 | 0.088 | 0.455 | 0.250 | 0.378 | 0.470 | 0.196 |
Mixtral-8x7B Instruct (Jiang et al., 2024) | 32,000 | 0.470 | 0.081 | 0.302 | 0.158 | 0.254 | 0.466 | 0.186 |
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) | 32,000 | 0.450 | 0.072 | 0.429 | 0.240 | 0.359 | 0.478 | 0.170 |
Llama 3 70B Instruct (Meta, 2024) | 8,192 | 0.466 | 0.068 | 0.463 | 0.264 | 0.387 | 0.478 | 0.157 |
GPT-4 (OpenAI, 2023) | 8,192 | 0.355 | 0.066 | 0.313 | 0.177 | 0.267 | 0.419 | 0.149 |
GPT-3.5-turbo (OpenAI, 2023) | 16,384 | 0.398 | 0.058 | 0.401 | 0.227 | 0.342 | 0.455 | 0.145 |
LLaMA 2 Chat 70B (Touvron et al., 2023) | 4,096 | 0.440 | 0.058 | 0.285 | 0.149 | 0.238 | 0.441 | 0.120 |
Claude 2.1 (Anthropic, 2023) | 200,000 | 0.341 | 0.041 | 0.412 | 0.208 | 0.344 | 0.426 | 0.110 |
Mistral-7B Instruct (Jiang et al., 2023) | 32,000 | 0.427 | 0.055 | 0.260 | 0.123 | 0.212 | 0.449 | 0.102 |
Gemma 1.1 7B IT (Gemma Team, 2024) | 8,192 | 0.187 | 0.023 | 0.194 | 0.088 | 0.158 | 0.367 | 0.048 |
14 entries match your current filters.
Model | Context Size | Loose | Strict | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEURT | GPT Judge |
---|---|---|---|---|---|---|---|---|
GPT-4o (OpenAI, 2024) | 128,000 | 0.580 | 0.162 | 0.494 | 0.310 | 0.443 | 0.530 | 0.365 |
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) | 32,000 | 0.559 | 0.126 | 0.503 | 0.289 | 0.429 | 0.530 | 0.272 |
GPT-4-turbo (OpenAI, 2023) | 128,000 | 0.470 | 0.109 | 0.356 | 0.207 | 0.314 | 0.487 | 0.262 |
Claude 3 Opus (Anthropic, 2024) | 200,000 | 0.370 | 0.094 | 0.099 | 0.049 | 0.088 | 0.520 | 0.244 |
Llama 3 70B Instruct (Meta, 2024) | 8,192 | 0.468 | 0.093 | 0.282 | 0.143 | 0.243 | 0.473 | 0.221 |
Claude 2.1 (Anthropic, 2023) | 200,000 | 0.471 | 0.086 | 0.295 | 0.157 | 0.253 | 0.485 | 0.218 |
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) | 128,000 | 0.404 | 0.070 | 0.279 | 0.138 | 0.245 | 0.430 | 0.186 |
GPT-4 (OpenAI, 2023) | 8,192 | 0.315 | 0.057 | 0.208 | 0.106 | 0.183 | 0.427 | 0.164 |
Mixtral-8x7B Instruct (Jiang et al., 2024) | 32,000 | 0.396 | 0.055 | 0.173 | 0.078 | 0.147 | 0.449 | 0.148 |
LLaMA 2 Chat 70B (Touvron et al., 2023) | 4,096 | 0.390 | 0.064 | 0.157 | 0.075 | 0.131 | 0.443 | 0.108 |
GPT-3.5-turbo (OpenAI, 2023) | 16,384 | 0.155 | 0.032 | 0.114 | 0.051 | 0.099 | 0.338 | 0.076 |
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) | 32,000 | 0.165 | 0.025 | 0.103 | 0.044 | 0.092 | 0.308 | 0.059 |
Gemma 1.1 7B IT (Gemma Team, 2024) | 8,192 | 0.148 | 0.008 | 0.165 | 0.079 | 0.135 | 0.346 | 0.022 |
Mistral-7B Instruct (Jiang et al., 2023) | 32,000 | 0.024 | 0.001 | 0.010 | 0.004 | 0.009 | 0.170 | 0.011 |
14 entries match your current filters.
Model | Context Size | Loose | Strict | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEURT | GPT Judge |
---|---|---|---|---|---|---|---|---|
Claude 3 Opus (Anthropic, 2024) | 200,000 | 0.715 | 0.265 | 0.463 | 0.279 | 0.394 | 0.535 | 0.541 |
GPT-4o (OpenAI, 2024) | 128,000 | 0.666 | 0.229 | 0.606 | 0.397 | 0.534 | 0.586 | 0.472 |
Claude 2.1 (Anthropic, 2023) | 200,000 | 0.653 | 0.215 | 0.423 | 0.262 | 0.354 | 0.508 | 0.470 |
GPT-4-turbo (OpenAI, 2023) | 128,000 | 0.628 | 0.192 | 0.614 | 0.395 | 0.523 | 0.581 | 0.413 |
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) | 32,000 | 0.659 | 0.180 | 0.585 | 0.353 | 0.495 | 0.582 | 0.410 |
GPT-4 (OpenAI, 2023) | 8,192 | 0.546 | 0.144 | 0.500 | 0.301 | 0.413 | 0.530 | 0.304 |
Mixtral-8x7B Instruct (Jiang et al., 2024) | 32,000 | 0.576 | 0.135 | 0.409 | 0.231 | 0.343 | 0.509 | 0.283 |
Llama 3 70B Instruct (Meta, 2024) | 8,192 | 0.573 | 0.113 | 0.500 | 0.285 | 0.404 | 0.521 | 0.271 |
GPT-3.5-turbo (OpenAI, 2023) | 16,384 | 0.517 | 0.102 | 0.455 | 0.252 | 0.358 | 0.497 | 0.243 |
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) | 32,000 | 0.541 | 0.116 | 0.526 | 0.301 | 0.420 | 0.521 | 0.222 |
Mistral-7B Instruct (Jiang et al., 2023) | 32,000 | 0.540 | 0.088 | 0.330 | 0.172 | 0.264 | 0.475 | 0.202 |
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) | 128,000 | 0.516 | 0.090 | 0.359 | 0.184 | 0.297 | 0.484 | 0.198 |
LLaMA 2 Chat 70B (Touvron et al., 2023) | 4,096 | 0.514 | 0.077 | 0.376 | 0.206 | 0.304 | 0.472 | 0.162 |
Gemma 1.1 7B IT (Gemma Team, 2024) | 8,192 | 0.188 | 0.012 | 0.207 | 0.108 | 0.168 | 0.379 | 0.036 |