Examples¶
This section provides working examples demonstrating LociSimiles usage.
Sample Data¶
The examples use sample Latin texts:
- Hieronymus samples - Query texts from Jerome's writings
- Vergil samples - Source texts from Virgil's works
- Ground truth - Annotated intertextual links for evaluation
Quick Start Example¶
from locisimiles.document import Document
from locisimiles.evaluator import IntertextEvaluator
from locisimiles.pipeline import (
ClassificationPipelineWithCandidateGeneration,
pretty_print,
)
# Load example query and source documents
query_doc = Document("./hieronymus_samples.csv", author="Hieronymus")
source_doc = Document("./vergil_samples.csv", author="Vergil")
print("Loaded query and source documents:")
print(f"Query Document: {query_doc}")
print(f"Source Document: {source_doc}")
print("=" * 70)
# Load the pipeline with pre-trained models
pipeline_two_stage = ClassificationPipelineWithCandidateGeneration(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="mps",
)
# Run the pipeline with the query and source documents
results_two_stage = pipeline_two_stage.run(
query=query_doc, # Query document
source=source_doc, # Source document
top_k=10, # Number of top similar candidates to classify
)
print("\nResults of the two-stage pipeline run:")
pretty_print(results_two_stage)
evaluator = IntertextEvaluator(
query_doc=query_doc,
source_doc=source_doc,
ground_truth_csv="./ground_truth.csv",
pipeline=pipeline_two_stage,
top_k=10,
threshold=0.5,
)
print("\nSingle sentence:\n", evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))
print("\nPer-sentence head:\n", evaluator.evaluate_all_queries().head(20))
print("\nMacro scores:\n", evaluator.evaluate(average="macro", with_match_only=True))
print("\nMicro scores:\n", evaluator.evaluate(average="micro", with_match_only=True))
Jupyter Notebook¶
For an interactive walkthrough, see the example notebook.
The notebook covers:
- Loading Documents - Creating Document objects from CSV files
- Two-Stage Pipeline - Using retrieval + classification
- Finding Optimal Threshold - Automatic threshold tuning
- Evaluating Different K Values - Comparing top-k settings
- Classification-Only Pipeline - Exhaustive pairwise comparison
Two-Stage Pipeline¶
The recommended approach combines fast retrieval with accurate classification:
from locisimiles.pipeline import ClassificationPipelineWithCandidateGeneration
from locisimiles.document import Document
# Load documents
query_doc = Document("./hieronymus_samples.csv", author="Hieronymus")
source_doc = Document("./vergil_samples.csv", author="Vergil")
# Initialize pipeline with pre-trained models
pipeline = ClassificationPipelineWithCandidateGeneration(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu", # or "cuda", "mps"
)
# Run the pipeline
results = pipeline.run(
query=query_doc,
source=source_doc,
top_k=10 # Number of candidates per query
)
Modular Pipeline (Recommended)¶
Build custom pipelines by composing a generator and a judge:
from locisimiles import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
from locisimiles.document import Document
# Load documents
query_doc = Document("./hieronymus_samples.csv", author="Hieronymus")
source_doc = Document("./vergil_samples.csv", author="Vergil")
# Compose a custom pipeline
pipeline = Pipeline(
generator=EmbeddingCandidateGenerator(
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu",
),
judge=ClassificationJudge(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
device="cpu",
),
)
# Run end-to-end
results = pipeline.run(query=query_doc, source=source_doc, top_k=10)
# Or run stages separately
candidates = pipeline.generate_candidates(query=query_doc, source=source_doc, top_k=10)
results = pipeline.judge_candidates(query=query_doc, candidates=candidates)
Custom Combinations¶
Mix and match generators and judges:
from locisimiles import Pipeline
from locisimiles.pipeline.generator import (
EmbeddingCandidateGenerator,
ExhaustiveCandidateGenerator,
RuleBasedCandidateGenerator,
)
from locisimiles.pipeline.judge import (
ClassificationJudge,
ThresholdJudge,
IdentityJudge,
)
# Retrieval + threshold (fast, no classifier needed)
pipeline = Pipeline(
generator=EmbeddingCandidateGenerator(device="cpu"),
judge=ThresholdJudge(top_k=5),
)
# Rule-based candidates + classification scoring
pipeline = Pipeline(
generator=RuleBasedCandidateGenerator(min_shared_words=2),
judge=ClassificationJudge(device="cpu"),
)
# Exhaustive + identity (all pairs, no filtering)
pipeline = Pipeline(
generator=ExhaustiveCandidateGenerator(),
judge=IdentityJudge(),
)
Saving Results¶
Pipeline results can be saved to CSV or JSON directly from the pipeline object, or by using the standalone utility functions.
Save from the pipeline¶
# Run the pipeline
results = pipeline.run(query=query_doc, source=source_doc, top_k=10)
# Save to CSV (columns: query_id, source_id, source_text, candidate_score, judgment_score)
pipeline.to_csv("results.csv")
# Save to JSON (object keyed by query_id)
pipeline.to_json("results.json")
Save with explicit results¶
Standalone utility functions¶
If you don't have a pipeline instance, use the standalone functions:
from locisimiles.pipeline import results_to_csv, results_to_json
results_to_csv(results, "results.csv")
results_to_json(results, "results.json")
Evaluation¶
Evaluate your results against ground truth annotations:
from locisimiles.evaluator import IntertextEvaluator
evaluator = IntertextEvaluator(
query_doc=query_doc,
source_doc=source_doc,
ground_truth_csv="./ground_truth.csv",
pipeline=pipeline,
top_k=10,
threshold=0.5,
)
# Evaluate a single query
print(evaluator.evaluate_single_query("hier. adv. iovin. 1.41"))
# Get metrics for all queries
print(evaluator.evaluate(average="macro"))
print(evaluator.evaluate(average="micro"))
Finding the Best Threshold¶
Automatically find the optimal probability threshold:
best_result, all_thresholds_df = evaluator.find_best_threshold(
metric="f1", # Optimize for F1 (or 'precision', 'recall', 'smr')
average="micro", # Use micro-averaging
)
print(f"Best threshold: {best_result['best_threshold']}")
print(f"Best F1 score: {best_result['best_f1']:.4f}")
Classification-Only Pipeline¶
For smaller datasets, use exhaustive pairwise comparison:
from locisimiles.pipeline import ClassificationPipeline
pipeline_clf = ClassificationPipeline(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
device="cpu",
)
results = pipeline_clf.run(
query=query_doc,
source=source_doc,
batch_size=32,
)
# Filter high-probability matches
threshold = 0.7
for query_id, judgments in results.items():
high_prob = [j for j in judgments if j.judgment_score > threshold]
if high_prob:
print(f"Query {query_id}:")
for j in high_prob:
print(f" {j.segment.id}: P={j.judgment_score:.3f}")
Running the Examples¶
To run the examples locally:
Or open example.ipynb in Jupyter for the interactive version.