Skip to content

Getting Started

This guide will help you get started with LociSimiles for finding intertextual links in Latin literature.

Installation

Prerequisites

  • Python 3.10 or higher
  • pip package manager

Installing from PyPI

pip install locisimiles

Installing from Source

git clone https://github.com/julianschelb/locisimiles.git
cd locisimiles
pip install -e ".[dev]"

Basic Concepts

Documents and Segments

LociSimiles works with Documents containing TextSegments. Each segment represents a unit of text (e.g., a verse, sentence, or passage).

from locisimiles import Document, TextSegment

# Create segments manually
segments = [
    TextSegment(id="1", text="Arma virumque cano"),
    TextSegment(id="2", text="Troiae qui primus ab oris"),
]

# Create a document
doc = Document(segments=segments)

Loading from CSV

Documents are typically loaded from CSV files:

doc = Document.from_csv("texts.csv")

The CSV should have columns for id and text (column names are configurable).

Pipelines

LociSimiles provides ready-to-use pipelines for detecting intertextual links. Each pipeline takes a query document and a source document and returns scored matches.

Two-Stage Pipeline

The recommended pipeline for most use cases. It first retrieves the most promising candidates using embedding similarity, then classifies each candidate pair with a fine-tuned transformer model.

from locisimiles import ClassificationPipelineWithCandidateGeneration
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipelineWithCandidateGeneration(
    classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
    embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device="cpu",  # or "cuda", "mps"
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

Classification Pipeline

Classifies every possible query–source pair using a fine-tuned sequence-classification model. More thorough but slower — best suited for smaller datasets.

from locisimiles import ClassificationPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipeline(
    classification_name="julian-schelb/xlm-roberta-large-class-lat-intertext-v1",
    device="cpu",
)

# Run pipeline
results = pipeline.run(query=query, source=source, batch_size=32)

Retrieval Pipeline

A fast, lightweight pipeline that ranks source segments by embedding similarity and applies a top-k or threshold criterion. No classification model needed.

from locisimiles import RetrievalPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RetrievalPipeline(
    embedding_model_name="julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device="cpu",
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=5)

Latin BERT Retrieval Pipeline (Gong-Style)

A contextual token-level retrieval pipeline using BERT. Computes word-level embeddings and scores segment pairs by their best token-token cosine similarity. Excels at detecting lexical echoes and word reuse.

from locisimiles import Document, LatinBertRetrievalPipeline

query = Document("query.csv")
source = Document("source.csv")

pipeline = LatinBertRetrievalPipeline(
    model_name="ashleygong03/bamman-burns-latin-bert",  # or model_path for local
    top_k=10,
    similarity_threshold=0.85,
    max_length=256,
    min_token_length=2,
    use_stopword_filter=True,
)

results = pipeline.run(query=query, source=source, top_k=10)

Word2Vec Retrieval Pipeline (Burns-Style)

A retrieval-only pipeline using a local Word2Vec model and pair-aware bigram scoring.

from locisimiles import Document, Word2VecRetrievalPipeline

query = Document("query.csv")
source = Document("source.csv")

pipeline = Word2VecRetrievalPipeline(
    model_path="./models/latin_w2v_bamman_lemma300_100_1.model",
    top_k=10,
    similarity_threshold=0.85,
    interval=2,
    order_free=True,
)

results = pipeline.run(query=query, source=source, top_k=10)

Model path expectations:

  • The path must reference a local gensim .model file.
  • If no path is provided, LociSimiles checks models/latin_w2v_bamman_lemma300_100_1.model.
  • The model is not downloaded automatically.
  • Input text should already be lemmatized.

Rule-Based Pipeline

A purely lexical pipeline that does not require any neural models. It finds shared words between segments and applies distance, punctuation, and optional POS / similarity filters.

from locisimiles import RuleBasedPipeline
from locisimiles import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RuleBasedPipeline(min_shared_words=2, max_distance=3)

# Run pipeline
results = pipeline.run(query=query, source=source)

Pipeline Summary

Pipeline Speed Models required Best for
ClassificationPipelineWithCandidateGeneration Medium Embedding + classifier Most use cases
ClassificationPipeline Slow Classifier Small datasets, exhaustive comparison
RetrievalPipeline Fast Embedding Quick similarity search
Word2VecRetrievalPipeline Fast Local Word2Vec .model Lightweight Burns-style retrieval
RuleBasedPipeline Fast None No GPU, lexical matching

Saving Results

Save pipeline output to CSV or JSON:

results = pipeline.run(query=query, source=source, top_k=10)

# Save directly from the pipeline
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")

# Or use standalone utility functions
from locisimiles.pipeline import results_to_csv, results_to_json
results_to_csv(results, "results.csv")
results_to_json(results, "results.json")

Building Custom Pipelines

For advanced use cases you can compose your own pipeline from individual generators and judges using the generic Pipeline class.

A generator selects candidate source segments for each query segment. A judge then scores or classifies each candidate pair.

Available Generators

Generator Description
EmbeddingCandidateGenerator Semantic similarity via sentence transformers + ChromaDB
ExhaustiveCandidateGenerator All pairs — no filtering
RuleBasedCandidateGenerator Lexical matching + linguistic filters
Word2VecCandidateGenerator Pair-aware Word2Vec bigram retrieval

Available Judges

Judge Description
ClassificationJudge Transformer sequence classification (P(positive))
ThresholdJudge Binary decisions from candidate scores (top-k or threshold)
IdentityJudge Pass-through — judgment_score = 1.0

Example

from locisimiles import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge

pipeline = Pipeline(
    generator=EmbeddingCandidateGenerator(device="cpu"),
    judge=ClassificationJudge(device="cpu"),
)

results = pipeline.run(query=query_doc, source=source_doc, top_k=10)

# You can also run each stage separately
candidates = pipeline.generate_candidates(query=query_doc, source=source_doc, top_k=10)
results = pipeline.judge_candidates(query=query_doc, candidates=candidates)

Evaluation

Use the IntertextEvaluator to assess detection quality:

from locisimiles import IntertextEvaluator

evaluator = IntertextEvaluator(
    predictions=predictions,
    ground_truth=ground_truth
)

metrics = evaluator.evaluate()
print(f"Precision: {metrics['precision']:.3f}")
print(f"Recall: {metrics['recall']:.3f}")
print(f"F1: {metrics['f1']:.3f}")

Next Steps

GUI Quick Flow (Word2Vec)

  1. Start GUI with locisimiles-gui.
  2. Upload query/source CSV files (seg_id, text).
  3. In Pipeline Configuration, choose Word2Vec Retrieval (Burns-Style).
  4. Set a valid local Word2Vec .model path.
  5. Configure Bigram Interval and Order-Free Bigrams.
  6. Run processing and inspect thresholded matches in the results step.