Skip to content

Pipelines

Ready-to-use pipelines for detecting intertextual parallels in Latin literature.

Each pipeline loads its own models and exposes a single run() method that accepts two Document objects and returns scored results.

TwoStagePipeline

Two-stage pipeline: embedding retrieval + classification.

Combines a fast embedding-based retrieval step with a more expensive classification step to efficiently identify intertextual parallels in large corpora.

Pipeline steps:

  1. Retrieval - Encode all segments with a sentence-transformer model and retrieve the top_k most similar source segments for each query segment using cosine similarity.
  2. Classification - Feed each query-candidate pair into a fine-tuned sequence-classification model. The positive-class probability is used as the judgment score.
PARAMETER DESCRIPTION
classification_name

HuggingFace model identifier for the sequence-classification model.

TYPE: str DEFAULT: 'julian-schelb/xlm-roberta-large-class-lat-intertext-v1'

embedding_model_name

HuggingFace model identifier for the sentence-transformer.

TYPE: str DEFAULT: 'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'

device

Torch device string ("cpu", "cuda", …).

TYPE: str | int | None DEFAULT: None

pos_class_idx

Index of the positive class in the classifier output.

TYPE: int DEFAULT: 1

Example
from locisimiles.pipeline import ClassificationPipelineWithCandidategeneration
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipelineWithCandidategeneration(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

device property

device: str

Device used by the classification judge.

ClassificationPipeline

Classification pipeline for exhaustive pairwise comparison.

For each query segment every source segment is considered as a candidate. Each query-source pair is then fed to a fine-tuned sequence-classification model that outputs the probability of the pair being an intertextual match.

Pipeline steps:

  1. Candidate generation - Create all possible query-source pairs.
  2. Classification - Score each pair with a HuggingFace sequence-classification model. The positive-class probability is used as the judgment score.
PARAMETER DESCRIPTION
classification_name

HuggingFace model identifier for the sequence-classification model.

TYPE: str DEFAULT: 'julian-schelb/xlm-roberta-large-class-lat-intertext-v1'

device

Torch device string ("cpu", "cuda", ...).

TYPE: str | int | None DEFAULT: None

pos_class_idx

Index of the positive class in the classifier output.

TYPE: int DEFAULT: 1

Example
from locisimiles.pipeline import ClassificationPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipeline(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source)

device property

device: str

Device used by the classification judge.

RetrievalPipeline

locisimiles.pipeline.retrieval.RetrievalPipeline

RetrievalPipeline(
    *,
    embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device: str | int | None = None,
    top_k: int = 10,
    similarity_threshold: Optional[float] = None,
)

Retrieval pipeline based on embedding similarity.

Uses a sentence-transformer model to encode query and source segments into dense vectors and ranks candidates by cosine similarity.

Pipeline steps:

  1. Encoding - Encode all query and source segments with a sentence-transformer model.
  2. Retrieval - For each query, retrieve all source segments ranked by cosine similarity.
  3. Thresholding - Mark the top_k most similar candidates as positive, or all candidates above similarity_threshold if provided.
PARAMETER DESCRIPTION
embedding_model_name

HuggingFace model identifier for the sentence-transformer.

TYPE: str DEFAULT: 'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'

device

Torch device string ("cpu", "cuda", …).

TYPE: str | int | None DEFAULT: None

top_k

Number of top candidates to mark as positive.

TYPE: int DEFAULT: 10

similarity_threshold

If provided, uses threshold instead of top-k.

TYPE: Optional[float] DEFAULT: None

Example
from locisimiles.pipeline import RetrievalPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RetrievalPipeline(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

device property

device: str

Device used by the embedding generator.

RuleBasedPipeline

locisimiles.pipeline.rule_based.RuleBasedPipeline

RuleBasedPipeline(
    *,
    min_shared_words: int = 2,
    min_complura: int = 4,
    max_distance: int = 3,
    similarity_threshold: float = 0.3,
    stopwords: Optional[Set[str]] = None,
    use_htrg: bool = False,
    use_similarity: bool = False,
    pos_model: str = "enelpol/evalatin2022-pos-open",
    spacy_model: str = "la_core_web_lg",
    device: Optional[str] = None,
)

Rule-based pipeline for lexical intertextuality detection.

Identifies potential quotations, allusions, and textual reuse between Latin documents using a multi-stage rule-based approach.

Pipeline steps:

  1. Text preprocessing - Normalise prefix assimilations (e.g. adt-att-, inm-imm-), quotation marks, and whitespace connectors. Apply genre-specific phrasing rules for prose or poetry.
  2. Text matching - Tokenise both documents and find shared non-stopword tokens between every query-source segment pair.
  3. Distance criterion - Discard matches where shared words are too far apart (controlled by max_distance).
  4. Scissa filter - Compare punctuation patterns around shared words to strengthen evidence of deliberate textual reuse.
  5. HTRG filter (optional) - Part-of-speech analysis using a HuggingFace token-classification model. Requires torch.
  6. Similarity filter (optional) - Word-embedding similarity check using spaCy vectors. Requires spacy.
PARAMETER DESCRIPTION
min_shared_words

Minimum number of shared non-stopwords required.

TYPE: int DEFAULT: 2

min_complura

Minimum adjacent tokens for complura detection.

TYPE: int DEFAULT: 4

max_distance

Maximum distance between shared words.

TYPE: int DEFAULT: 3

similarity_threshold

Threshold for semantic similarity filter.

TYPE: float DEFAULT: 0.3

stopwords

Set of stopwords to exclude. Uses defaults if None.

TYPE: Optional[Set[str]] DEFAULT: None

use_htrg

Whether to apply HTRG (POS-based) filter. Requires torch.

TYPE: bool DEFAULT: False

use_similarity

Whether to apply similarity filter. Requires spacy.

TYPE: bool DEFAULT: False

pos_model

HuggingFace model name for POS tagging.

TYPE: str DEFAULT: 'enelpol/evalatin2022-pos-open'

spacy_model

spaCy model name for embeddings.

TYPE: str DEFAULT: 'la_core_web_lg'

device

Device for neural models ("cuda", "cpu", or None for auto).

TYPE: Optional[str] DEFAULT: None

Example
from locisimiles.pipeline import RuleBasedPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RuleBasedPipeline()

# Run pipeline
results = pipeline.run(query=query, source=source)

min_shared_words property

min_shared_words: int

Minimum number of shared non-stopwords required.

min_complura property

min_complura: int

Minimum adjacent tokens for complura detection.

max_distance property

max_distance: int

Maximum distance between shared words.

similarity_threshold property

similarity_threshold: float

Threshold for semantic similarity filter.

stopwords property

stopwords: Set[str]

Set of stopwords to exclude.

use_htrg property

use_htrg: bool

Whether HTRG (POS-based) filter is enabled.

use_similarity property

use_similarity: bool

Whether similarity filter is enabled.

device property

device: str

Device for neural models.

load_stopwords

load_stopwords(filepath: Union[str, Path]) -> None

Load stopwords from a file (one word per line).

PARAMETER DESCRIPTION
filepath

Path to stopwords file.

TYPE: Union[str, Path]