FanOutQA Leaderboards

These leaderboards contain the test-set scores of various models. To submit your own model's generations to the leaderboards, see Test Set Evaluation.

Closed Book

In the closed book setting, models must answer fan-out questions using only parametric knowledge.

14 entries match your current filters.

ModelContext Size Loose Strict ROUGE-1 ROUGE-2 ROUGE-L BLEURT GPT Judge
GPT-4o (OpenAI, 2024) 128,0000.4410.0810.4740.2730.4170.4880.214
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) 128,0000.4740.0860.4720.2660.3970.4880.210
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) 32,0000.5150.1100.4440.2440.3690.4950.200
GPT-4-turbo (OpenAI, 2023) 128,0000.4600.1010.4820.2900.4090.4930.199
Claude 3 Opus (Anthropic, 2024) 200,0000.4480.0880.4550.2500.3780.4700.196
Mixtral-8x7B Instruct (Jiang et al., 2024) 32,0000.4700.0810.3020.1580.2540.4660.186
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) 32,0000.4500.0720.4290.2400.3590.4780.170
Llama 3 70B Instruct (Meta, 2024) 8,1920.4660.0680.4630.2640.3870.4780.157
GPT-4 (OpenAI, 2023) 8,1920.3550.0660.3130.1770.2670.4190.149
GPT-3.5-turbo (OpenAI, 2023) 16,3840.3980.0580.4010.2270.3420.4550.145
LLaMA 2 Chat 70B (Touvron et al., 2023) 4,0960.4400.0580.2850.1490.2380.4410.120
Claude 2.1 (Anthropic, 2023) 200,0000.3410.0410.4120.2080.3440.4260.110
Mistral-7B Instruct (Jiang et al., 2023) 32,0000.4270.0550.2600.1230.2120.4490.102
Gemma 1.1 7B IT (Gemma Team, 2024) 8,1920.1870.0230.1940.0880.1580.3670.048

Open Book

In the open book setting, models are given access to retrieval tools and must retrieve Wikipedia articles to answer fan-out questions.

14 entries match your current filters.

ModelContext Size Loose Strict ROUGE-1 ROUGE-2 ROUGE-L BLEURT GPT Judge
GPT-4o (OpenAI, 2024) 128,0000.5800.1620.4940.3100.4430.5300.365
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) 32,0000.5590.1260.5030.2890.4290.5300.272
GPT-4-turbo (OpenAI, 2023) 128,0000.4700.1090.3560.2070.3140.4870.262
Claude 3 Opus (Anthropic, 2024) 200,0000.3700.0940.0990.0490.0880.5200.244
Llama 3 70B Instruct (Meta, 2024) 8,1920.4680.0930.2820.1430.2430.4730.221
Claude 2.1 (Anthropic, 2023) 200,0000.4710.0860.2950.1570.2530.4850.218
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) 128,0000.4040.0700.2790.1380.2450.4300.186
GPT-4 (OpenAI, 2023) 8,1920.3150.0570.2080.1060.1830.4270.164
Mixtral-8x7B Instruct (Jiang et al., 2024) 32,0000.3960.0550.1730.0780.1470.4490.148
LLaMA 2 Chat 70B (Touvron et al., 2023) 4,0960.3900.0640.1570.0750.1310.4430.108
GPT-3.5-turbo (OpenAI, 2023) 16,3840.1550.0320.1140.0510.0990.3380.076
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) 32,0000.1650.0250.1030.0440.0920.3080.059
Gemma 1.1 7B IT (Gemma Team, 2024) 8,1920.1480.0080.1650.0790.1350.3460.022
Mistral-7B Instruct (Jiang et al., 2023) 32,0000.0240.0010.0100.0040.0090.1700.011

Evidence Provided

In the evidence provided setting, models are given the text of the articles needed to answer a fan-out question.

14 entries match your current filters.

ModelContext Size Loose Strict ROUGE-1 ROUGE-2 ROUGE-L BLEURT GPT Judge
Claude 3 Opus (Anthropic, 2024) 200,0000.7150.2650.4630.2790.3940.5350.541
GPT-4o (OpenAI, 2024) 128,0000.6660.2290.6060.3970.5340.5860.472
Claude 2.1 (Anthropic, 2023) 200,0000.6530.2150.4230.2620.3540.5080.470
GPT-4-turbo (OpenAI, 2023) 128,0000.6280.1920.6140.3950.5230.5810.413
Llama3.3 70b distilled DS - 32k ss (SambaNova Systems, 2025) 32,0000.6590.1800.5850.3530.4950.5820.410
GPT-4 (OpenAI, 2023) 8,1920.5460.1440.5000.3010.4130.5300.304
Mixtral-8x7B Instruct (Jiang et al., 2024) 32,0000.5760.1350.4090.2310.3430.5090.283
Llama 3 70B Instruct (Meta, 2024) 8,1920.5730.1130.5000.2850.4040.5210.271
GPT-3.5-turbo (OpenAI, 2023) 16,3840.5170.1020.4550.2520.3580.4970.243
Mistral-Small-Instruct-2409 (22B) (Mistral AI, 2024) 32,0000.5410.1160.5260.3010.4200.5210.222
Mistral-7B Instruct (Jiang et al., 2023) 32,0000.5400.0880.3300.1720.2640.4750.202
Mistral-Large-Instruct-2407 (123B) (Mistral AI, 2024) 128,0000.5160.0900.3590.1840.2970.4840.198
LLaMA 2 Chat 70B (Touvron et al., 2023) 4,0960.5140.0770.3760.2060.3040.4720.162
Gemma 1.1 7B IT (Gemma Team, 2024) 8,1920.1880.0120.2070.1080.1680.3790.036