API Reference#

Main Entrypoints#

fanoutqa.load_dev(fp: str | bytes | PathLike | None = None) → list[DevQuestion][source]#

Load all questions from the development set.

Parameters:: fp – The path to load the questions from (defaults to bundled FOQA).

fanoutqa.load_test(fp: str | bytes | PathLike | None = None) → list[TestQuestion][source]#

Load all questions from the test set.

Parameters:: fp – The path to load the questions from (defaults to bundled FOQA).

fanoutqa.eval.evaluate(

questions: list[DevQuestion],

answers: list[Answer],

**kwargs,

) → EvaluationScore[source]#

Evaluate all FOQA metrics across the given questions and generated answers.

Parameters:

questions – The questions and reference answers, as loaded by the dataset.
answers – The generated answers to score. These should be dictionaries like {"id": "...", "answer": "..."}
only_score_answered – Whether to only score questions that have an answer (True), or consider unanswered questions to have 0 score (False, default). This is useful for evaluating only a subset of the dataset.
llm_cache_key – If this is provided, cache the LLM-as-judge generations with this key. We recommend setting this to a human-readable key for each system under test.

Wikipedia Retrieval#

fanoutqa.wiki_search(query: str, results=10) → list[Evidence][source]#: Return a list of Evidence documents given the search query.

fanoutqa.wiki_content(doc: Evidence) → str[source]#: Get the page content in markdown, including tables and infoboxes, appropriate for displaying to an LLM.

Models#

class fanoutqa.models.Evidence(pageid: int, revid: int, title: str, url: str)[source]#

A reference to a Wikipedia article at a given point in time.

pageid: int#: Wikipedia page ID.

revid: int#: Wikipedia revision ID of page as of dataset epoch. Often referred to as oldid in Wikipedia API docs.

title: str#: Title of page.

url: str#: Link to page.

A human-written decomposition of a top-level question.

id: str#: The ID of the question.

question: str#: The question for the system to answer.

decomposition: list[DevSubquestion]#: A human-written decomposition of the question.

answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str#: The human-written reference answer to this subquestion.

depends_on: list[str]#: The IDs of subquestions that this subquestion requires answering first.

evidence: Evidence | None#: The Wikipedia page used by the human annotator to answer this question. If this is None, the question will have a decomposition.

class fanoutqa.models.DevQuestion( id: str, question: str, decomposition: list[DevSubquestion], answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str, categories: list[str], )[source]#

A top-level question in the FOQA dataset and its decomposition.

id: str#: The ID of the question.

question: str#: The top-level question for the system to answer.

decomposition: list[DevSubquestion]#: A human-written decomposition of the question.

answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str#: A human-written reference answer to the question.

property necessary_evidence: list[Evidence]#: A list of all the evidence used by human annotators to answer the question.

class fanoutqa.models.TestQuestion(id: str, question: str, necessary_evidence: list[Evidence], categories: list[str])[source]#

A top-level question in the FOQA dataset, without its decomposition or answer.

id: str#: The ID of the question.

question: str#: The top-level question for the system to answer.

necessary_evidence: list[Evidence]#: A list of all the evidence used by human annotators to answer the question.

class fanoutqa.eval.models.AccuracyScore(loose: float, strict: float)[source]#

loose: float#: Loose accuracy: The mean proportion of reference strings found in the generation.

strict: float#: Strict accuracy: The proportion of questions with a loose accuracy of 1.0.

class fanoutqa.eval.models.RougeScorePart(precision: float, recall: float, fscore: float)[source]#

class fanoutqa.eval.models.RougeScore( rouge1: fanoutqa.eval.models.RougeScorePart, rouge2: fanoutqa.eval.models.RougeScorePart, rougeL: fanoutqa.eval.models.RougeScorePart, )[source]#

class fanoutqa.eval.models.EvaluationSingleScore(question_id: str, acc: float, rouge: fanoutqa.eval.models.RougeScore, bleurt: float, gpt: int)[source]#

class fanoutqa.eval.models.EvaluationScore( acc: fanoutqa.eval.models.AccuracyScore, rouge: fanoutqa.eval.models.RougeScore, bleurt: float, gpt: float, raw: list[fanoutqa.eval.models.EvaluationSingleScore], )[source]#

class fanoutqa.eval.models.Answer[source]#: A dictionary of the form {"id": "...", "answer": "..."}.

Baseline Retriever#

This module contains a baseline implementation of a retriever for use with long Wikipedia articles

class fanoutqa.retrieval.RetrievalResult(title: str, content: str)[source]#

title: str#: The title of the article this fragment comes from.

content: str#: The content of the fragment.

class fanoutqa.retrieval.Corpus(documents: list[Evidence], doc_len: int = 2048)[source]#

A corpus of wiki docs. Indexes the docs on creation, normalizing the text beforehand with lemmatization.

Splits the documents into chunks no longer than a given length, preferring splitting on paragraph and sentence boundaries. Documents will be converted to Markdown.

Uses BM25+ (Lv and Zhai, 2011), a TF-IDF based approach to retrieve document fragments.

To retrieve chunks corresponding to a query, iterate over Corpus.best(query).

# example of how to use in the Evidence Provided setting
prompt = "..."
corpus = fanoutqa.retrieval.Corpus(q.necessary_evidence)
for fragment in corpus.best(q.question):
    # use your own structured prompt format here
    prompt += f"# {fragment.title}\n{fragment.content}\n\n"

Parameters:

documents – The list of evidences to index
doc_len – The maximum length, in characters, of each chunk

best(q: str) → Iterable[RetrievalResult][source]#: Yield the best matching fragments to the given query.

fanoutqa.retrieval.chunk_text(text, max_chunk_size=1024, chunk_on=('\n\n', '\n', '. ', ', ', ' '), chunker_i=0)[source]#: Recursively chunks text into a list of str, with each element no longer than max_chunk_size. Prefers splitting on the elements of chunk_on, in order.