Pipelines¶

Ready-to-use pipelines for detecting intertextual parallels in Latin literature.

Each pipeline loads its own models and exposes a single run() method that accepts two Document objects and returns scored results.

TwoStagePipeline¶

Two-stage pipeline: embedding retrieval + classification.

Combines a fast embedding-based retrieval step with a more expensive classification step to efficiently identify intertextual parallels in large corpora.

Pipeline steps:

Retrieval - Encode all segments with a sentence-transformer model and retrieve the top_k most similar source segments for each query segment using cosine similarity.
Classification - Feed each query-candidate pair into a fine-tuned sequence-classification model. The positive-class probability is used as the judgment score.

PARAMETER	DESCRIPTION
`classification_name`	HuggingFace model identifier for the sequence-classification model. TYPE: `str` DEFAULT: `'julian-schelb/xlm-roberta-large-class-lat-intertext-v1'`
`embedding_model_name`	HuggingFace model identifier for the sentence-transformer. TYPE: `str` DEFAULT: `'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'`
`device`	Torch device string (`"cpu"`, `"cuda"`, …). TYPE: `str \| int \| None` DEFAULT: `None`
`pos_class_idx`	Index of the positive class in the classifier output. TYPE: `int` DEFAULT: `1`

Example

from locisimiles.pipeline import ClassificationPipelineWithCandidategeneration
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipelineWithCandidategeneration(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

device `property` ¶

device: str

Device used by the classification judge.

ClassificationPipeline¶

Classification pipeline for exhaustive pairwise comparison.

For each query segment every source segment is considered as a candidate. Each query-source pair is then fed to a fine-tuned sequence-classification model that outputs the probability of the pair being an intertextual match.

Pipeline steps:

Candidate generation - Create all possible query-source pairs.
Classification - Score each pair with a HuggingFace sequence-classification model. The positive-class probability is used as the judgment score.

PARAMETER	DESCRIPTION
`classification_name`	HuggingFace model identifier for the sequence-classification model. TYPE: `str` DEFAULT: `'julian-schelb/xlm-roberta-large-class-lat-intertext-v1'`
`device`	Torch device string (`"cpu"`, `"cuda"`, ...). TYPE: `str \| int \| None` DEFAULT: `None`
`pos_class_idx`	Index of the positive class in the classifier output. TYPE: `int` DEFAULT: `1`

Example

from locisimiles.pipeline import ClassificationPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = ClassificationPipeline(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source)

device `property` ¶

device: str

Device used by the classification judge.

RetrievalPipeline¶

locisimiles.pipeline.retrieval.RetrievalPipeline ¶

RetrievalPipeline(
    *,
    embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device: str | int | None = None,
    top_k: int = 10,
    similarity_threshold: Optional[float] = None,
)

Retrieval pipeline based on embedding similarity.

Uses a sentence-transformer model to encode query and source segments into dense vectors and ranks candidates by cosine similarity.

Pipeline steps:

Encoding - Encode all query and source segments with a sentence-transformer model.
Retrieval - For each query, retrieve all source segments ranked by cosine similarity.
Thresholding - Mark the top_k most similar candidates as positive, or all candidates above similarity_threshold if provided.

PARAMETER	DESCRIPTION
`embedding_model_name`	HuggingFace model identifier for the sentence-transformer. TYPE: `str` DEFAULT: `'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'`
`device`	Torch device string (`"cpu"`, `"cuda"`, …). TYPE: `str \| int \| None` DEFAULT: `None`
`top_k`	Number of top candidates to mark as positive. TYPE: `int` DEFAULT: `10`
`similarity_threshold`	If provided, uses threshold instead of top-k. TYPE: `Optional[float]` DEFAULT: `None`

Example

from locisimiles.pipeline import RetrievalPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RetrievalPipeline(device="cpu")

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

device `property` ¶

device: str

Device used by the embedding generator.

RuleBasedPipeline¶

locisimiles.pipeline.rule_based.RuleBasedPipeline ¶

RuleBasedPipeline(
    *,
    min_shared_words: int = 2,
    min_complura: int = 4,
    max_distance: int = 3,
    similarity_threshold: float = 0.3,
    stopwords: Optional[Set[str]] = None,
    use_htrg: bool = False,
    use_similarity: bool = False,
    pos_model: str = "enelpol/evalatin2022-pos-open",
    spacy_model: str = "la_core_web_lg",
    device: Optional[str] = None,
)

Rule-based pipeline for lexical intertextuality detection.

Identifies potential quotations, allusions, and textual reuse between Latin documents using a multi-stage rule-based approach.

Pipeline steps:

Text preprocessing - Normalise prefix assimilations (e.g. adt- → att-, inm- → imm-), quotation marks, and whitespace connectors. Apply genre-specific phrasing rules for prose or poetry.
Text matching - Tokenise both documents and find shared non-stopword tokens between every query-source segment pair.
Distance criterion - Discard matches where shared words are too far apart (controlled by max_distance).
Scissa filter - Compare punctuation patterns around shared words to strengthen evidence of deliberate textual reuse.
HTRG filter (optional) - Part-of-speech analysis using a HuggingFace token-classification model. Requires torch.
Similarity filter (optional) - Word-embedding similarity check using spaCy vectors. Requires spacy.

PARAMETER	DESCRIPTION
`min_shared_words`	Minimum number of shared non-stopwords required. TYPE: `int` DEFAULT: `2`
`min_complura`	Minimum adjacent tokens for complura detection. TYPE: `int` DEFAULT: `4`
`max_distance`	Maximum distance between shared words. TYPE: `int` DEFAULT: `3`
`similarity_threshold`	Threshold for semantic similarity filter. TYPE: `float` DEFAULT: `0.3`
`stopwords`	Set of stopwords to exclude. Uses defaults if `None`. TYPE: `Optional[Set[str]]` DEFAULT: `None`
`use_htrg`	Whether to apply HTRG (POS-based) filter. Requires torch. TYPE: `bool` DEFAULT: `False`
`use_similarity`	Whether to apply similarity filter. Requires spacy. TYPE: `bool` DEFAULT: `False`
`pos_model`	HuggingFace model name for POS tagging. TYPE: `str` DEFAULT: `'enelpol/evalatin2022-pos-open'`
`spacy_model`	spaCy model name for embeddings. TYPE: `str` DEFAULT: `'la_core_web_lg'`
`device`	Device for neural models (`"cuda"`, `"cpu"`, or `None` for auto). TYPE: `Optional[str]` DEFAULT: `None`

Example

from locisimiles.pipeline import RuleBasedPipeline
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Define pipeline
pipeline = RuleBasedPipeline()

# Run pipeline
results = pipeline.run(query=query, source=source)

min_shared_words `property` ¶

min_shared_words: int

Minimum number of shared non-stopwords required.

min_complura `property` ¶

min_complura: int

Minimum adjacent tokens for complura detection.

max_distance `property` ¶

max_distance: int

Maximum distance between shared words.

similarity_threshold `property` ¶

similarity_threshold: float

Threshold for semantic similarity filter.

stopwords `property` ¶

stopwords: Set[str]

Set of stopwords to exclude.

use_htrg `property` ¶

use_htrg: bool

Whether HTRG (POS-based) filter is enabled.

use_similarity `property` ¶

use_similarity: bool

Whether similarity filter is enabled.

device `property` ¶

device: str

Device for neural models.

load_stopwords ¶

load_stopwords(filepath: Union[str, Path]) -> None

Load stopwords from a file (one word per line).

PARAMETER	DESCRIPTION
`filepath`	Path to stopwords file. TYPE: `Union[str, Path]`

Pipelines¶

TwoStagePipeline¶

device property ¶

ClassificationPipeline¶

device property ¶

RetrievalPipeline¶

locisimiles.pipeline.retrieval.RetrievalPipeline ¶

device property ¶

RuleBasedPipeline¶

locisimiles.pipeline.rule_based.RuleBasedPipeline ¶

min_shared_words property ¶

min_complura property ¶

max_distance property ¶

similarity_threshold property ¶

stopwords property ¶

use_htrg property ¶

use_similarity property ¶

device property ¶

load_stopwords ¶

device `property` ¶

device `property` ¶

device `property` ¶

min_shared_words `property` ¶

min_complura `property` ¶

max_distance `property` ¶

similarity_threshold `property` ¶

stopwords `property` ¶

use_htrg `property` ¶

use_similarity `property` ¶

device `property` ¶