Getting Started¶

This guide will help you get started with LociSimiles for finding intertextual links in Latin literature.

Installation¶

Prerequisites¶

Python 3.10 or higher
pip package manager

Installing from PyPI¶

pip install locisimiles

Installing from Source¶

git clone https://github.com/julianschelb/locisimiles.git
cd locisimiles
pip install -e ".[dev]"

Basic Concepts¶

Documents and Segments¶

LociSimiles works with Documents containing TextSegments. Each segment represents a unit of text (e.g., a verse, sentence, or passage).

from locisimiles import Document, TextSegment

# Create segments manually
segments = [
    TextSegment(id="1", text="Arma virumque cano"),
    TextSegment(id="2", text="Troiae qui primus ab oris"),
]

# Create a document
doc = Document(segments=segments)

Loading from CSV¶

Documents are typically loaded from CSV files:

doc = Document.from_csv("texts.csv")

The CSV should have columns for id and text (column names are configurable).

Pipelines¶

LociSimiles provides ready-to-use pipelines for detecting intertextual links. Each pipeline takes a query document and a source document and returns scored matches.

Two-Stage Pipeline¶

The recommended pipeline for most use cases. It first retrieves the most promising candidates using embedding similarity, then classifies each candidate pair with a fine-tuned transformer model.

from locisimiles import ClassificationPipelineWithCandidateGeneration
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipelineWithCandidateGeneration(
    classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
    embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device="cpu",  # or "cuda", "mps"
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

Classification Pipeline¶

Classifies every possible query–source pair using a fine-tuned sequence-classification model. More thorough but slower — best suited for smaller datasets.

from locisimiles import ClassificationPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipeline(
    classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
    device="cpu",
)

# Run pipeline
results = pipeline.run(query=query, source=source, batch_size=32)

Retrieval Pipeline¶

A fast, lightweight pipeline that ranks source segments by embedding similarity and applies a top-k or threshold criterion. No classification model needed.

from locisimiles import RetrievalPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RetrievalPipeline(
    embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device="cpu",
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=5)

Latin BERT Retrieval Pipeline (Gong-Style)¶

A contextual token-level retrieval pipeline using BERT. Computes word-level embeddings and scores segment pairs by their best token-token cosine similarity. Excels at detecting lexical echoes and word reuse.

from locisimiles import Document, LatinBertRetrievalPipeline

query = Document("query.csv")
source = Document("source.csv")

pipeline = LatinBertRetrievalPipeline(
    model_name="ashleygong03/bamman-burns-latin-bert",  # or model_path for local
    top_k=10,
    similarity_threshold=0.85,
    max_length=256,
    min_token_length=2,
    use_stopword_filter=True,
)

results = pipeline.run(query=query, source=source, top_k=10)

Word2Vec Retrieval Pipeline (Burns-Style)¶

A retrieval-only pipeline using a local Word2Vec model and pair-aware bigram scoring.

from locisimiles import Document, Word2VecRetrievalPipeline

query = Document("query.csv")
source = Document("source.csv")

pipeline = Word2VecRetrievalPipeline(
    model_path="./models/latin_w2v_bamman_lemma300_100_1.model",
    top_k=10,
    similarity_threshold=0.85,
    interval=2,
    order_free=True,
)

results = pipeline.run(query=query, source=source, top_k=10)

Model path expectations:

The path must reference a local gensim .model file.
If no path is provided, LociSimiles checks models/latin_w2v_bamman_lemma300_100_1.model.
The model is not downloaded automatically.
Input text should already be lemmatized.

Rule-Based Pipeline¶

A purely lexical pipeline that does not require any neural models. It finds shared words between segments and applies distance, punctuation, and optional POS / similarity filters.

from locisimiles import RuleBasedPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RuleBasedPipeline(min_shared_words=2, max_distance=3)

# Run pipeline
results = pipeline.run(query=query, source=source)

Pipeline Summary¶

Pipeline	Speed	Models required	Best for
`ClassificationPipelineWithCandidateGeneration`	Medium	Embedding + classifier	Most use cases
`ClassificationPipeline`	Slow	Classifier	Small datasets, exhaustive comparison
`RetrievalPipeline`	Fast	Embedding	Quick similarity search
`Word2VecRetrievalPipeline`	Fast	Local Word2Vec `.model`	Lightweight Burns-style retrieval
`RuleBasedPipeline`	Fast	None	No GPU, lexical matching

Saving Results¶

Save pipeline output to CSV or JSON:

results = pipeline.run(query=query, source=source, top_k=10)

# Save directly from the pipeline
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")

# Or use standalone utility functions
from locisimiles.pipeline import results_to_csv, results_to_json
results_to_csv(results, "results.csv")
results_to_json(results, "results.json")

Building Custom Pipelines¶

For advanced use cases you can compose your own pipeline from individual generators and judges using the generic Pipeline class.

A generator selects candidate source segments for each query segment. A judge then scores or classifies each candidate pair.

Available Generators¶

Generator	Description
`EmbeddingCandidateGenerator`	Semantic similarity via sentence transformers + ChromaDB
`ExhaustiveCandidateGenerator`	All pairs — no filtering
`RuleBasedCandidateGenerator`	Lexical matching + linguistic filters
`Word2VecCandidateGenerator`	Pair-aware Word2Vec bigram retrieval

Available Judges¶

Judge	Description
`ClassificationJudge`	Transformer sequence classification (P(positive))
`ThresholdJudge`	Binary decisions from candidate scores (top-k or threshold)
`IdentityJudge`	Pass-through — `judgment_score = 1.0`

Example¶

from locisimiles import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge

pipeline = Pipeline(
    generator=EmbeddingCandidateGenerator(device="cpu"),
    judge=ClassificationJudge(device="cpu"),
)

results = pipeline.run(query=query_doc, source=source_doc, top_k=10)

# You can also run each stage separately
candidates = pipeline.generate_candidates(query=query_doc, source=source_doc, top_k=10)
results = pipeline.judge_candidates(query=query_doc, candidates=candidates)

Evaluation¶

Use the IntertextEvaluator to assess detection quality:

from locisimiles import IntertextEvaluator

evaluator = IntertextEvaluator(
    predictions=predictions,
    ground_truth=ground_truth
)

metrics = evaluator.evaluate()
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1: {metrics['f1']:.3f}")

Next Steps¶

See the CLI Reference for command-line usage
Explore the API Reference for detailed documentation
Check out the examples for complete workflows

GUI Quick Flow (Word2Vec)¶

Start GUI with locisimiles-gui.
Upload query/source CSV files (seg_id, text).
In Pipeline Configuration, choose Word2Vec Retrieval (Burns-Style).
Set a valid local Word2Vec .model path.
Configure Bigram Interval and Order-Free Bigrams.
Run processing and inspect thresholded matches in the results step.