Pipelines¶
Ready-to-use pipelines for detecting intertextual parallels in Latin literature.
Each pipeline loads its own models and exposes a single run() method that
accepts two Document objects and returns scored results.
TwoStagePipeline¶
Two-stage pipeline: embedding retrieval + classification.
Combines a fast embedding-based retrieval step with a more expensive classification step to efficiently identify intertextual parallels in large corpora.
Pipeline steps:
- Retrieval - Encode all segments with a sentence-transformer model and retrieve the top_k most similar source segments for each query segment using cosine similarity.
- Classification - Feed each query-candidate pair into a fine-tuned sequence-classification model. The positive-class probability is used as the judgment score.
| PARAMETER | DESCRIPTION |
|---|---|
classification_name
|
HuggingFace model identifier for the sequence-classification model.
TYPE:
|
embedding_model_name
|
HuggingFace model identifier for the sentence-transformer.
TYPE:
|
device
|
Torch device string (
TYPE:
|
pos_class_idx
|
Index of the positive class in the classifier output.
TYPE:
|
Example
from locisimiles.pipeline import ClassificationPipelineWithCandidategeneration
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipelineWithCandidategeneration(device="cpu")
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)
ClassificationPipeline¶
Classification pipeline for exhaustive pairwise comparison.
For each query segment every source segment is considered as a candidate. Each query-source pair is then fed to a fine-tuned sequence-classification model that outputs the probability of the pair being an intertextual match.
Pipeline steps:
- Candidate generation - Create all possible query-source pairs.
- Classification - Score each pair with a HuggingFace sequence-classification model. The positive-class probability is used as the judgment score.
| PARAMETER | DESCRIPTION |
|---|---|
classification_name
|
HuggingFace model identifier for the sequence-classification model.
TYPE:
|
device
|
Torch device string (
TYPE:
|
pos_class_idx
|
Index of the positive class in the classifier output.
TYPE:
|
Example
from locisimiles.pipeline import ClassificationPipeline
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = ClassificationPipeline(device="cpu")
# Run pipeline
results = pipeline.run(query=query, source=source)
RetrievalPipeline¶
locisimiles.pipeline.retrieval.RetrievalPipeline
¶
RetrievalPipeline(
*,
embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device: str | int | None = None,
top_k: int = 10,
similarity_threshold: Optional[float] = None,
)
Retrieval pipeline based on embedding similarity.
Uses a sentence-transformer model to encode query and source segments into dense vectors and ranks candidates by cosine similarity.
Pipeline steps:
- Encoding - Encode all query and source segments with a sentence-transformer model.
- Retrieval - For each query, retrieve all source segments ranked by cosine similarity.
- Thresholding - Mark the top_k most similar candidates as positive, or all candidates above similarity_threshold if provided.
| PARAMETER | DESCRIPTION |
|---|---|
embedding_model_name
|
HuggingFace model identifier for the sentence-transformer.
TYPE:
|
device
|
Torch device string (
TYPE:
|
top_k
|
Number of top candidates to mark as positive.
TYPE:
|
similarity_threshold
|
If provided, uses threshold instead of top-k.
TYPE:
|
Example
from locisimiles.pipeline import RetrievalPipeline
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Define pipeline
pipeline = RetrievalPipeline(device="cpu")
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)
RuleBasedPipeline¶
locisimiles.pipeline.rule_based.RuleBasedPipeline
¶
RuleBasedPipeline(
*,
min_shared_words: int = 2,
min_complura: int = 4,
max_distance: int = 3,
similarity_threshold: float = 0.3,
stopwords: Optional[Set[str]] = None,
use_htrg: bool = False,
use_similarity: bool = False,
pos_model: str = "enelpol/evalatin2022-pos-open",
spacy_model: str = "la_core_web_lg",
device: Optional[str] = None,
)
Rule-based pipeline for lexical intertextuality detection.
Identifies potential quotations, allusions, and textual reuse between Latin documents using a multi-stage rule-based approach.
Pipeline steps:
- Text preprocessing - Normalise prefix assimilations (e.g. adt- → att-, inm- → imm-), quotation marks, and whitespace connectors. Apply genre-specific phrasing rules for prose or poetry.
- Text matching - Tokenise both documents and find shared non-stopword tokens between every query-source segment pair.
- Distance criterion - Discard matches where shared words
are too far apart (controlled by
max_distance). - Scissa filter - Compare punctuation patterns around shared words to strengthen evidence of deliberate textual reuse.
- HTRG filter (optional) - Part-of-speech analysis using a
HuggingFace token-classification model. Requires
torch. - Similarity filter (optional) - Word-embedding similarity
check using spaCy vectors. Requires
spacy.
| PARAMETER | DESCRIPTION |
|---|---|
min_shared_words
|
Minimum number of shared non-stopwords required.
TYPE:
|
min_complura
|
Minimum adjacent tokens for complura detection.
TYPE:
|
max_distance
|
Maximum distance between shared words.
TYPE:
|
similarity_threshold
|
Threshold for semantic similarity filter.
TYPE:
|
stopwords
|
Set of stopwords to exclude. Uses defaults if
TYPE:
|
use_htrg
|
Whether to apply HTRG (POS-based) filter. Requires torch.
TYPE:
|
use_similarity
|
Whether to apply similarity filter. Requires spacy.
TYPE:
|
pos_model
|
HuggingFace model name for POS tagging.
TYPE:
|
spacy_model
|
spaCy model name for embeddings.
TYPE:
|
device
|
Device for neural models (
TYPE:
|