API Reference#

Main Entrypoints#

fanoutqa.load_dev(fp: str | bytes | PathLike | None = None) list[DevQuestion][source]#

Load all questions from the development set.

Parameters:

fp – The path to load the questions from (defaults to bundled FOQA).

fanoutqa.load_test(fp: str | bytes | PathLike | None = None) list[TestQuestion][source]#

Load all questions from the test set.

Parameters:

fp – The path to load the questions from (defaults to bundled FOQA).

fanoutqa.eval.evaluate(
questions: list[DevQuestion],
answers: list[Answer],
**kwargs,
) EvaluationScore[source]#

Evaluate all FOQA metrics across the given questions and generated answers.

Parameters:
  • questions – The questions and reference answers, as loaded by the dataset.

  • answers – The generated answers to score. These should be dictionaries like {"id": "...", "answer": "..."}

  • only_score_answered – Whether to only score questions that have an answer (True), or consider unanswered questions to have 0 score (False, default). This is useful for evaluating only a subset of the dataset.

  • llm_cache_key – If this is provided, cache the LLM-as-judge generations with this key. We recommend setting this to a human-readable key for each system under test.

Wikipedia Retrieval#

Return a list of Evidence documents given the search query.

fanoutqa.wiki_content(doc: Evidence) str[source]#

Get the page content in markdown, including tables and infoboxes, appropriate for displaying to an LLM.

Models#

class fanoutqa.models.Evidence(pageid: int, revid: int, title: str, url: str)[source]#

A reference to a Wikipedia article at a given point in time.

pageid: int#

Wikipedia page ID.

revid: int#

Wikipedia revision ID of page as of dataset epoch. Often referred to as oldid in Wikipedia API docs.

title: str#

Title of page.

url: str#

Link to page.

class fanoutqa.models.DevSubquestion(
id: str,
question: str,
decomposition: list[DevSubquestion],
answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str,
depends_on: list[str],
evidence: Evidence | None,
)[source]#

A human-written decomposition of a top-level question.

id: str#

The ID of the question.

question: str#

The question for the system to answer.

decomposition: list[DevSubquestion]#

A human-written decomposition of the question.

answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str#

The human-written reference answer to this subquestion.

depends_on: list[str]#

The IDs of subquestions that this subquestion requires answering first.

evidence: Evidence | None#

The Wikipedia page used by the human annotator to answer this question. If this is None, the question will have a decomposition.

class fanoutqa.models.DevQuestion(
id: str,
question: str,
decomposition: list[DevSubquestion],
answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str,
categories: list[str],
)[source]#

A top-level question in the FOQA dataset and its decomposition.

id: str#

The ID of the question.

question: str#

The top-level question for the system to answer.

decomposition: list[DevSubquestion]#

A human-written decomposition of the question.

answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str#

A human-written reference answer to the question.

property necessary_evidence: list[Evidence]#

A list of all the evidence used by human annotators to answer the question.

class fanoutqa.models.TestQuestion(id: str, question: str, necessary_evidence: list[Evidence], categories: list[str])[source]#

A top-level question in the FOQA dataset, without its decomposition or answer.

id: str#

The ID of the question.

question: str#

The top-level question for the system to answer.

necessary_evidence: list[Evidence]#

A list of all the evidence used by human annotators to answer the question.

class fanoutqa.eval.models.AccuracyScore(loose: float, strict: float)[source]#
loose: float#

Loose accuracy: The mean proportion of reference strings found in the generation.

strict: float#

Strict accuracy: The proportion of questions with a loose accuracy of 1.0.

class fanoutqa.eval.models.RougeScorePart(precision: float, recall: float, fscore: float)[source]#
class fanoutqa.eval.models.RougeScore(
rouge1: fanoutqa.eval.models.RougeScorePart,
rouge2: fanoutqa.eval.models.RougeScorePart,
rougeL: fanoutqa.eval.models.RougeScorePart,
)[source]#
class fanoutqa.eval.models.EvaluationSingleScore(question_id: str, acc: float, rouge: fanoutqa.eval.models.RougeScore, bleurt: float, gpt: int)[source]#
class fanoutqa.eval.models.EvaluationScore(
acc: fanoutqa.eval.models.AccuracyScore,
rouge: fanoutqa.eval.models.RougeScore,
bleurt: float,
gpt: float,
raw: list[fanoutqa.eval.models.EvaluationSingleScore],
)[source]#
class fanoutqa.eval.models.Answer[source]#

A dictionary of the form {"id": "...", "answer": "..."}.

Baseline Retriever#

This module contains a baseline implementation of a retriever for use with long Wikipedia articles

class fanoutqa.retrieval.RetrievalResult(title: str, content: str)[source]#
title: str#

The title of the article this fragment comes from.

content: str#

The content of the fragment.

class fanoutqa.retrieval.Corpus(documents: list[Evidence], doc_len: int = 2048)[source]#

A corpus of wiki docs. Indexes the docs on creation, normalizing the text beforehand with lemmatization.

Splits the documents into chunks no longer than a given length, preferring splitting on paragraph and sentence boundaries. Documents will be converted to Markdown.

Uses BM25+ (Lv and Zhai, 2011), a TF-IDF based approach to retrieve document fragments.

To retrieve chunks corresponding to a query, iterate over Corpus.best(query).

# example of how to use in the Evidence Provided setting
prompt = "..."
corpus = fanoutqa.retrieval.Corpus(q.necessary_evidence)
for fragment in corpus.best(q.question):
    # use your own structured prompt format here
    prompt += f"# {fragment.title}\n{fragment.content}\n\n"
Parameters:
  • documents – The list of evidences to index

  • doc_len – The maximum length, in characters, of each chunk

best(q: str) Iterable[RetrievalResult][source]#

Yield the best matching fragments to the given query.

fanoutqa.retrieval.chunk_text(text, max_chunk_size=1024, chunk_on=('\n\n', '\n', '. ', ', ', ' '), chunker_i=0)[source]#

Recursively chunks text into a list of str, with each element no longer than max_chunk_size. Prefers splitting on the elements of chunk_on, in order.