Getting Started¶
This guide will help you get started with LociSimiles for finding intertextual links in Latin literature.
Installation¶
Prerequisites¶
- Python 3.10 or higher
- pip package manager
Installing from PyPI¶
Installing from Source¶
Basic Concepts¶
Documents and Segments¶
LociSimiles works with Documents containing TextSegments. Each segment represents a unit of text (e.g., a verse, sentence, or passage).
from locisimiles import Document, TextSegment
# Create segments manually
segments = [
TextSegment(id="1", text="Arma virumque cano"),
TextSegment(id="2", text="Troiae qui primus ab oris"),
]
# Create a document
doc = Document(segments=segments)
Loading from CSV¶
Documents are typically loaded from CSV files:
The CSV should have columns for id and text (column names are configurable).
Pipelines¶
LociSimiles provides ready-to-use pipelines for detecting intertextual links. Each pipeline takes a query document and a source document and returns scored matches.
Two-Stage Pipeline¶
The recommended pipeline for most use cases. It first retrieves the most promising candidates using embedding similarity, then classifies each candidate pair with a fine-tuned transformer model.
from locisimiles import ClassificationPipelineWithCandidateGeneration
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipelineWithCandidateGeneration(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu", # or "cuda", "mps"
)
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)
Classification Pipeline¶
Classifies every possible query–source pair using a fine-tuned sequence-classification model. More thorough but slower — best suited for smaller datasets.
from locisimiles import ClassificationPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipeline(
classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
device="cpu",
)
# Run pipeline
results = pipeline.run(query=query, source=source, batch_size=32)
Retrieval Pipeline¶
A fast, lightweight pipeline that ranks source segments by embedding similarity and applies a top-k or threshold criterion. No classification model needed.
from locisimiles import RetrievalPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = RetrievalPipeline(
embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device="cpu",
)
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=5)
Latin BERT Retrieval Pipeline (Gong-Style)¶
A contextual token-level retrieval pipeline using BERT. Computes word-level embeddings and scores segment pairs by their best token-token cosine similarity. Excels at detecting lexical echoes and word reuse.
from locisimiles import Document, LatinBertRetrievalPipeline
query = Document("query.csv")
source = Document("source.csv")
pipeline = LatinBertRetrievalPipeline(
model_name="ashleygong03/bamman-burns-latin-bert", # or model_path for local
top_k=10,
similarity_threshold=0.85,
max_length=256,
min_token_length=2,
use_stopword_filter=True,
)
results = pipeline.run(query=query, source=source, top_k=10)
Word2Vec Retrieval Pipeline (Burns-Style)¶
A retrieval-only pipeline using a local Word2Vec model and pair-aware bigram scoring.
from locisimiles import Document, Word2VecRetrievalPipeline
query = Document("query.csv")
source = Document("source.csv")
pipeline = Word2VecRetrievalPipeline(
model_path="./models/latin_w2v_bamman_lemma300_100_1.model",
top_k=10,
similarity_threshold=0.85,
interval=2,
order_free=True,
)
results = pipeline.run(query=query, source=source, top_k=10)
Model path expectations:
- The path must reference a local gensim
.modelfile. - If no path is provided, LociSimiles checks
models/latin_w2v_bamman_lemma300_100_1.model. - The model is not downloaded automatically.
- Input text should already be lemmatized.
Rule-Based Pipeline¶
A purely lexical pipeline that does not require any neural models. It finds shared words between segments and applies distance, punctuation, and optional POS / similarity filters.
from locisimiles import RuleBasedPipeline
from locisimiles import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = RuleBasedPipeline(min_shared_words=2, max_distance=3)
# Run pipeline
results = pipeline.run(query=query, source=source)
Pipeline Summary¶
| Pipeline | Speed | Models required | Best for |
|---|---|---|---|
ClassificationPipelineWithCandidateGeneration |
Medium | Embedding + classifier | Most use cases |
ClassificationPipeline |
Slow | Classifier | Small datasets, exhaustive comparison |
RetrievalPipeline |
Fast | Embedding | Quick similarity search |
Word2VecRetrievalPipeline |
Fast | Local Word2Vec .model |
Lightweight Burns-style retrieval |
RuleBasedPipeline |
Fast | None | No GPU, lexical matching |
Saving Results¶
Save pipeline output to CSV or JSON:
results = pipeline.run(query=query, source=source, top_k=10)
# Save directly from the pipeline
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")
# Or use standalone utility functions
from locisimiles.pipeline import results_to_csv, results_to_json
results_to_csv(results, "results.csv")
results_to_json(results, "results.json")
Building Custom Pipelines¶
For advanced use cases you can compose your own pipeline from individual
generators and judges using the generic Pipeline class.
A generator selects candidate source segments for each query segment. A judge then scores or classifies each candidate pair.
Available Generators¶
| Generator | Description |
|---|---|
EmbeddingCandidateGenerator |
Semantic similarity via sentence transformers + ChromaDB |
ExhaustiveCandidateGenerator |
All pairs — no filtering |
RuleBasedCandidateGenerator |
Lexical matching + linguistic filters |
Word2VecCandidateGenerator |
Pair-aware Word2Vec bigram retrieval |
Available Judges¶
| Judge | Description |
|---|---|
ClassificationJudge |
Transformer sequence classification (P(positive)) |
ThresholdJudge |
Binary decisions from candidate scores (top-k or threshold) |
IdentityJudge |
Pass-through — judgment_score = 1.0 |
Example¶
from locisimiles import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
pipeline = Pipeline(
generator=EmbeddingCandidateGenerator(device="cpu"),
judge=ClassificationJudge(device="cpu"),
)
results = pipeline.run(query=query_doc, source=source_doc, top_k=10)
# You can also run each stage separately
candidates = pipeline.generate_candidates(query=query_doc, source=source_doc, top_k=10)
results = pipeline.judge_candidates(query=query_doc, candidates=candidates)
Evaluation¶
Use the IntertextEvaluator to assess detection quality:
from locisimiles import IntertextEvaluator
evaluator = IntertextEvaluator(
predictions=predictions,
ground_truth=ground_truth
)
metrics = evaluator.evaluate()
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1: {metrics['f1']:.3f}")
Next Steps¶
- See the CLI Reference for command-line usage
- Explore the API Reference for detailed documentation
- Check out the examples for complete workflows
GUI Quick Flow (Word2Vec)¶
- Start GUI with
locisimiles-gui. - Upload query/source CSV files (
seg_id,text). - In Pipeline Configuration, choose Word2Vec Retrieval (Burns-Style).
- Set a valid local Word2Vec
.modelpath. - Configure Bigram Interval and Order-Free Bigrams.
- Run processing and inspect thresholded matches in the results step.