Evaluator Module¶

Tools for assessing detection quality.

IntertextEvaluator¶

Evaluate detection results against ground truth annotations.

Computes precision, recall, F1, and other metrics for intertextual link detection.

locisimiles.evaluator.IntertextEvaluator ¶

IntertextEvaluator(
    *,
    query_doc: Document,
    source_doc: Document,
    ground_truth_csv: str | DataFrame,
    pipeline: Pipeline,
    top_k: int = 5,
    threshold: float | str = "auto",
    auto_threshold_metric: str = "smr",
)

Evaluator for measuring intertextuality detection performance.

This class computes sentence-level and document-level evaluation metrics by comparing pipeline predictions against ground truth annotations.

Supported metrics

Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1: Harmonic mean of precision and recall
SMR: Source Match Rate (error rate)
Accuracy: (TP + TN) / Total

The evaluator runs the pipeline once during initialization and caches the results for efficient metric computation across different thresholds.

ATTRIBUTE	DESCRIPTION
`query_doc`	The query document being analyzed.
`source_doc`	The source document containing potential quotation origins.
`predictions`	Cached pipeline predictions (CandidateJudgeOutput format). TYPE: `CandidateJudgeOutput`
`threshold`	Probability threshold for positive classification.
`gold_labels`	Ground truth annotations loaded from CSV.

Example

from locisimiles.evaluator import IntertextEvaluator
from locisimiles.pipeline import TwoStagePipeline
from locisimiles.document import Document

# Load documents
query_doc = Document("hieronymus.csv")
source_doc = Document("vergil.csv")

# Initialize pipeline
pipeline = TwoStagePipeline(device="cpu")

# Create evaluator with auto-threshold
evaluator = IntertextEvaluator(
    query_doc=query_doc,
    source_doc=source_doc,
    ground_truth_csv="ground_truth.csv",
    pipeline=pipeline,
    top_k=10,
    threshold="auto",  # Automatically find best threshold
    auto_threshold_metric="smr",
)

# Get evaluation metrics
print(evaluator.evaluate(average="micro"))
print(evaluator.evaluate(average="macro"))

# Evaluate single query
print(evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))

# Find optimal threshold for different metrics
best, all_thresholds = evaluator.find_best_threshold(metric="f1")
print(f"Best F1 at threshold {best['best_threshold']}: {best['best_f1']:.3f}")

evaluate_single_query ¶

evaluate_single_query(query_id: str) -> Dict[str, float]

Compute metrics for one query sentence.

query_ids_with_match ¶

query_ids_with_match() -> List[str]

Return query IDs that have ground truth labels.

evaluate_all_queries ¶

evaluate_all_queries(
    with_match_only: bool = False,
) -> DataFrame

Compute metrics for every query sentence (cached).

evaluate ¶

evaluate(
    *, average: str = "macro", with_match_only: bool = False
) -> Dict[str, float]

Compute aggregated metrics across queries.

Precision, Recall, F1, Accuracy: ALWAYS computed on queries with at least one ground truth match (otherwise these metrics are meaningless).
FPR, FNR, SMR: Computed on ALL queries by default (measures false alarms on queries that shouldn't have matches). If with_match_only=True, these are also restricted to queries with matches.

confusion_matrix ¶

confusion_matrix(query_id: str) -> ndarray

Return 2x2 confusion matrix [[TP,FP],[FN,TN]] for one query sentence.

find_best_threshold ¶

find_best_threshold(
    *,
    metric: str = "f1",
    thresholds: List[float] | None = None,
    average: str = "micro",
    with_match_only: bool = False,
) -> Tuple[Dict[str, float], DataFrame]

Find the optimal probability threshold based on the given metric.

evaluate_k_values ¶

evaluate_k_values(
    *,
    k_values: List[int] | None = None,
    average: str = "micro",
    with_match_only: bool = False,
) -> Dict[int, Dict[str, float]]

Evaluate metrics for different top_k values WITHOUT re-running the pipeline.