Getting Started¶
This guide will help you get started with LociSimiles for finding intertextual links in Latin literature.
Installation¶
Prerequisites¶
- Python 3.10 or higher
- pip package manager
Installing from PyPI¶
Installing from Source¶
Basic Concepts¶
Documents and Segments¶
LociSimiles works with Documents containing TextSegments. Each segment represents a unit of text (e.g., a verse, sentence, or passage).
from locisimiles import Document, TextSegment
# Create segments manually
segments = [
TextSegment(id="1", text="Arma virumque cano"),
TextSegment(id="2", text="Troiae qui primus ab oris"),
]
# Create a document
doc = Document(segments=segments)
Loading from CSV¶
Documents are typically loaded from CSV files:
The CSV should have columns for id and text (column names are configurable).
Pipelines¶
LociSimiles provides ready-to-use pipelines for detecting intertextual links. Each pipeline takes a query document and a source document and returns scored matches.
Two-Stage Pipeline¶
The recommended pipeline for most use cases. It first retrieves the most promising candidates using embedding similarity, then classifies each candidate pair with a fine-tuned transformer model.
from locisimiles import ClassificationPipelineWithCandidateGeneration
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipelineWithCandidateGeneration(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu", # or "cuda", "mps"
)
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)
Classification Pipeline¶
Classifies every possible query–source pair using a fine-tuned sequence-classification model. More thorough but slower — best suited for smaller datasets.
from locisimiles import ClassificationPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipeline(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
device="cpu",
)
# Run pipeline
results = pipeline.run(query=query, source=source, batch_size=32)
Retrieval Pipeline¶
A fast, lightweight pipeline that ranks source segments by embedding similarity and applies a top-k or threshold criterion. No classification model needed.
from locisimiles import RetrievalPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = RetrievalPipeline(
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu",
)
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=5)
Rule-Based Pipeline¶
A purely lexical pipeline that does not require any neural models. It finds shared words between segments and applies distance, punctuation, and optional POS / similarity filters.
from locisimiles import RuleBasedPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = RuleBasedPipeline(min_shared_words=2, max_distance=3)
# Run pipeline
results = pipeline.run(query=query, source=source)
Pipeline Summary¶
| Pipeline | Speed | Models required | Best for |
|---|---|---|---|
ClassificationPipelineWithCandidateGeneration |
Medium | Embedding + classifier | Most use cases |
ClassificationPipeline |
Slow | Classifier | Small datasets, exhaustive comparison |
RetrievalPipeline |
Fast | Embedding | Quick similarity search |
RuleBasedPipeline |
Fast | None | No GPU, lexical matching |
Saving Results¶
Save pipeline output to CSV or JSON:
results = pipeline.run(query=query, source=source, top_k=10)
# Save directly from the pipeline
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")
# Or use standalone utility functions
from locisimiles.pipeline import results_to_csv, results_to_json
results_to_csv(results, "results.csv")
results_to_json(results, "results.json")
Building Custom Pipelines¶
For advanced use cases you can compose your own pipeline from individual
generators and judges using the generic Pipeline class.
A generator selects candidate source segments for each query segment. A judge then scores or classifies each candidate pair.
Available Generators¶
| Generator | Description |
|---|---|
EmbeddingCandidateGenerator |
Semantic similarity via sentence transformers + ChromaDB |
ExhaustiveCandidateGenerator |
All pairs — no filtering |
RuleBasedCandidateGenerator |
Lexical matching + linguistic filters |
Available Judges¶
| Judge | Description |
|---|---|
ClassificationJudge |
Transformer sequence classification (P(positive)) |
ThresholdJudge |
Binary decisions from candidate scores (top-k or threshold) |
IdentityJudge |
Pass-through — judgment_score = 1.0 |
Example¶
from locisimiles import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
pipeline = Pipeline(
generator=EmbeddingCandidateGenerator(device="cpu"),
judge=ClassificationJudge(device="cpu"),
)
results = pipeline.run(query=query_doc, source=source_doc, top_k=10)
# You can also run each stage separately
candidates = pipeline.generate_candidates(query=query_doc, source=source_doc, top_k=10)
results = pipeline.judge_candidates(query=query_doc, candidates=candidates)
Evaluation¶
Use the IntertextEvaluator to assess detection quality:
from locisimiles import IntertextEvaluator
evaluator = IntertextEvaluator(
predictions=predictions,
ground_truth=ground_truth
)
metrics = evaluator.evaluate()
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1: {metrics['f1']:.3f}")
Next Steps¶
- See the CLI Reference for command-line usage
- Explore the API Reference for detailed documentation
- Check out the examples for complete workflows