API Reference#
Main Entrypoints#
- fanoutqa.load_dev(fp: str | bytes | PathLike | None = None) list[DevQuestion] [source]#
Load all questions from the development set.
- Parameters:
fp – The path to load the questions from (defaults to bundled FOQA).
- fanoutqa.load_test(fp: str | bytes | PathLike | None = None) list[TestQuestion] [source]#
Load all questions from the test set.
- Parameters:
fp – The path to load the questions from (defaults to bundled FOQA).
- fanoutqa.eval.evaluate(
- questions: list[DevQuestion],
- answers: list[Answer],
- **kwargs,
Evaluate all FOQA metrics across the given questions and generated answers.
- Parameters:
questions – The questions and reference answers, as loaded by the dataset.
answers – The generated answers to score. These should be dictionaries like
{"id": "...", "answer": "..."}
only_score_answered – Whether to only score questions that have an answer (True), or consider unanswered questions to have 0 score (False, default). This is useful for evaluating only a subset of the dataset.
llm_cache_key – If this is provided, cache the LLM-as-judge generations with this key. We recommend setting this to a human-readable key for each system under test.
Wikipedia Retrieval#
Models#
- class fanoutqa.models.Evidence(pageid: int, revid: int, title: str, url: str)[source]#
A reference to a Wikipedia article at a given point in time.
- class fanoutqa.models.DevSubquestion(
- id: str,
- question: str,
- decomposition: list[DevSubquestion],
- answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str,
- depends_on: list[str],
- evidence: Evidence | None,
A human-written decomposition of a top-level question.
- decomposition: list[DevSubquestion]#
A human-written decomposition of the question.
- class fanoutqa.models.DevQuestion(
- id: str,
- question: str,
- decomposition: list[DevSubquestion],
- answer: dict[str, bool | int | float | str] | list[bool | int | float | str] | bool | int | float | str,
- categories: list[str],
A top-level question in the FOQA dataset and its decomposition.
- decomposition: list[DevSubquestion]#
A human-written decomposition of the question.
- class fanoutqa.models.TestQuestion(id: str, question: str, necessary_evidence: list[Evidence], categories: list[str])[source]#
A top-level question in the FOQA dataset, without its decomposition or answer.
- class fanoutqa.eval.models.RougeScore(
- rouge1: fanoutqa.eval.models.RougeScorePart,
- rouge2: fanoutqa.eval.models.RougeScorePart,
- rougeL: fanoutqa.eval.models.RougeScorePart,
- class fanoutqa.eval.models.EvaluationSingleScore(question_id: str, acc: float, rouge: fanoutqa.eval.models.RougeScore, bleurt: float, gpt: int)[source]#
- class fanoutqa.eval.models.EvaluationScore(
- acc: fanoutqa.eval.models.AccuracyScore,
- rouge: fanoutqa.eval.models.RougeScore,
- bleurt: float,
- gpt: float,
- raw: list[fanoutqa.eval.models.EvaluationSingleScore],
Baseline Retriever#
This module contains a baseline implementation of a retriever for use with long Wikipedia articles
- class fanoutqa.retrieval.Corpus(documents: list[Evidence], doc_len: int = 2048)[source]#
A corpus of wiki docs. Indexes the docs on creation, normalizing the text beforehand with lemmatization.
Splits the documents into chunks no longer than a given length, preferring splitting on paragraph and sentence boundaries. Documents will be converted to Markdown.
Uses BM25+ (Lv and Zhai, 2011), a TF-IDF based approach to retrieve document fragments.
To retrieve chunks corresponding to a query, iterate over
Corpus.best(query)
.# example of how to use in the Evidence Provided setting prompt = "..." corpus = fanoutqa.retrieval.Corpus(q.necessary_evidence) for fragment in corpus.best(q.question): # use your own structured prompt format here prompt += f"# {fragment.title}\n{fragment.content}\n\n"
- Parameters:
documents – The list of evidences to index
doc_len – The maximum length, in characters, of each chunk
- best(q: str) Iterable[RetrievalResult] [source]#
Yield the best matching fragments to the given query.