Evaluator Module¶
Tools for assessing detection quality.
IntertextEvaluator¶
Evaluate detection results against ground truth annotations.
Computes precision, recall, F1, and other metrics for intertextual link detection.
locisimiles.evaluator.IntertextEvaluator
¶
IntertextEvaluator(
*,
query_doc: Document,
source_doc: Document,
ground_truth_csv: str | DataFrame,
pipeline: Pipeline,
top_k: int = 5,
threshold: float | str = "auto",
auto_threshold_metric: str = "smr",
)
Evaluator for measuring intertextuality detection performance.
This class computes sentence-level and document-level evaluation metrics by comparing pipeline predictions against ground truth annotations.
Supported metrics
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1: Harmonic mean of precision and recall
- SMR: Source Match Rate (error rate)
- Accuracy: (TP + TN) / Total
The evaluator runs the pipeline once during initialization and caches the results for efficient metric computation across different thresholds.
| ATTRIBUTE | DESCRIPTION |
|---|---|
query_doc |
The query document being analyzed.
|
source_doc |
The source document containing potential quotation origins.
|
predictions |
Cached pipeline predictions (CandidateJudgeOutput format).
TYPE:
|
threshold |
Probability threshold for positive classification.
|
gold_labels |
Ground truth annotations loaded from CSV.
|
Example
from locisimiles.evaluator import IntertextEvaluator
from locisimiles.pipeline import TwoStagePipeline
from locisimiles.document import Document
# Load documents
query_doc = Document("hieronymus.csv")
source_doc = Document("vergil.csv")
# Initialize pipeline
pipeline = TwoStagePipeline(device="cpu")
# Create evaluator with auto-threshold
evaluator = IntertextEvaluator(
query_doc=query_doc,
source_doc=source_doc,
ground_truth_csv="ground_truth.csv",
pipeline=pipeline,
top_k=10,
threshold="auto", # Automatically find best threshold
auto_threshold_metric="smr",
)
# Get evaluation metrics
print(evaluator.evaluate(average="micro"))
print(evaluator.evaluate(average="macro"))
# Evaluate single query
print(evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))
# Find optimal threshold for different metrics
best, all_thresholds = evaluator.find_best_threshold(metric="f1")
print(f"Best F1 at threshold {best['best_threshold']}: {best['best_f1']:.3f}")
evaluate_single_query
¶
Compute metrics for one query sentence.
query_ids_with_match
¶
Return query IDs that have ground truth labels.
evaluate_all_queries
¶
Compute metrics for every query sentence (cached).
evaluate
¶
Compute aggregated metrics across queries.
- Precision, Recall, F1, Accuracy: ALWAYS computed on queries with at least one ground truth match (otherwise these metrics are meaningless).
- FPR, FNR, SMR: Computed on ALL queries by default (measures false alarms on queries that shouldn't have matches). If with_match_only=True, these are also restricted to queries with matches.
confusion_matrix
¶
Return 2x2 confusion matrix [[TP,FP],[FN,TN]] for one query sentence.
find_best_threshold
¶
find_best_threshold(
*,
metric: str = "f1",
thresholds: List[float] | None = None,
average: str = "micro",
with_match_only: bool = False,
) -> Tuple[Dict[str, float], DataFrame]
Find the optimal probability threshold based on the given metric.
evaluate_k_values
¶
evaluate_k_values(
*,
k_values: List[int] | None = None,
average: str = "micro",
with_match_only: bool = False,
) -> Dict[int, Dict[str, float]]
Evaluate metrics for different top_k values WITHOUT re-running the pipeline.