Skip to content

Evaluator Module

Tools for assessing detection quality.

IntertextEvaluator

Evaluate detection results against ground truth annotations.

Computes precision, recall, F1, and other metrics for intertextual link detection.

locisimiles.evaluator.IntertextEvaluator

IntertextEvaluator(
    *,
    query_doc: Document,
    source_doc: Document,
    ground_truth_csv: str | DataFrame,
    pipeline: Pipeline,
    top_k: int = 5,
    threshold: float | str = "auto",
    auto_threshold_metric: str = "smr",
)

Evaluator for measuring intertextuality detection performance.

This class computes sentence-level and document-level evaluation metrics by comparing pipeline predictions against ground truth annotations.

Supported metrics
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1: Harmonic mean of precision and recall
  • SMR: Source Match Rate (error rate)
  • Accuracy: (TP + TN) / Total

The evaluator runs the pipeline once during initialization and caches the results for efficient metric computation across different thresholds.

ATTRIBUTE DESCRIPTION
query_doc

The query document being analyzed.

source_doc

The source document containing potential quotation origins.

predictions

Cached pipeline predictions (CandidateJudgeOutput format).

TYPE: CandidateJudgeOutput

threshold

Probability threshold for positive classification.

gold_labels

Ground truth annotations loaded from CSV.

Example
from locisimiles.evaluator import IntertextEvaluator
from locisimiles.pipeline import TwoStagePipeline
from locisimiles.document import Document

# Load documents
query_doc = Document("hieronymus.csv")
source_doc = Document("vergil.csv")

# Initialize pipeline
pipeline = TwoStagePipeline(device="cpu")

# Create evaluator with auto-threshold
evaluator = IntertextEvaluator(
    query_doc=query_doc,
    source_doc=source_doc,
    ground_truth_csv="ground_truth.csv",
    pipeline=pipeline,
    top_k=10,
    threshold="auto",  # Automatically find best threshold
    auto_threshold_metric="smr",
)

# Get evaluation metrics
print(evaluator.evaluate(average="micro"))
print(evaluator.evaluate(average="macro"))

# Evaluate single query
print(evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))

# Find optimal threshold for different metrics
best, all_thresholds = evaluator.find_best_threshold(metric="f1")
print(f"Best F1 at threshold {best['best_threshold']}: {best['best_f1']:.3f}")

evaluate_single_query

evaluate_single_query(query_id: str) -> Dict[str, float]

Compute metrics for one query sentence.

query_ids_with_match

query_ids_with_match() -> List[str]

Return query IDs that have ground truth labels.

evaluate_all_queries

evaluate_all_queries(
    with_match_only: bool = False,
) -> DataFrame

Compute metrics for every query sentence (cached).

evaluate

evaluate(
    *, average: str = "macro", with_match_only: bool = False
) -> Dict[str, float]

Compute aggregated metrics across queries.

  • Precision, Recall, F1, Accuracy: ALWAYS computed on queries with at least one ground truth match (otherwise these metrics are meaningless).
  • FPR, FNR, SMR: Computed on ALL queries by default (measures false alarms on queries that shouldn't have matches). If with_match_only=True, these are also restricted to queries with matches.

confusion_matrix

confusion_matrix(query_id: str) -> ndarray

Return 2x2 confusion matrix [[TP,FP],[FN,TN]] for one query sentence.

find_best_threshold

find_best_threshold(
    *,
    metric: str = "f1",
    thresholds: List[float] | None = None,
    average: str = "micro",
    with_match_only: bool = False,
) -> Tuple[Dict[str, float], DataFrame]

Find the optimal probability threshold based on the given metric.

evaluate_k_values

evaluate_k_values(
    *,
    k_values: List[int] | None = None,
    average: str = "micro",
    with_match_only: bool = False,
) -> Dict[int, Dict[str, float]]

Evaluate metrics for different top_k values WITHOUT re-running the pipeline.