Generators¶

Candidate generators narrow the search space by selecting source segments that are most likely to be relevant for each query segment.

All generators inherit from CandidateGeneratorBase and implement a generate() method returning CandidateGeneratorOutput.

CandidateGeneratorBase¶

locisimiles.pipeline.generator._base.CandidateGeneratorBase ¶

Abstract base class for candidate generators.

A candidate generator narrows the search space by producing a ranked list of source segments for each query segment. The output is a CandidateGeneratorOutput — a dictionary mapping query-segment IDs to lists of Candidate objects, each containing a source segment and a relevance score.

Subclasses must implement generate().

Available implementations:

EmbeddingCandidateGenerator — semantic similarity via sentence transformers + ChromaDB.
ExhaustiveCandidateGenerator — returns all query–source pairs without filtering.
RuleBasedCandidateGenerator — lexical matching with linguistic filters for Latin texts.

generate `abstractmethod` ¶

generate(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Generate candidate segments from source for each query segment.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`source`	Source document. TYPE: `Document`
`**kwargs`	Generator-specific parameters. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`CandidateGeneratorOutput`	Mapping of query segment IDs → lists of `Candidate` objects.

EmbeddingCandidateGenerator¶

Generate candidates using semantic embedding similarity with sentence transformers and ChromaDB.

locisimiles.pipeline.generator.embedding.EmbeddingCandidateGenerator ¶

EmbeddingCandidateGenerator(
    *,
    embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device: str | int | None = None,
)

Generate candidates using semantic embedding similarity.

Encodes query and source segments with a sentence-transformer model, builds an ephemeral ChromaDB index on the source embeddings, and retrieves the most similar source segments for each query segment.

The number of candidates per query is controlled by the top_k parameter passed to generate().

PARAMETER	DESCRIPTION
`embedding_model_name`	HuggingFace model identifier for the sentence-transformer. Defaults to the pre-trained Latin intertextuality model. TYPE: `str` DEFAULT: `'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'`
`device`	Torch device string (`"cpu"`, `"cuda"`, `"mps"`). TYPE: `str \| int \| None` DEFAULT: `None`

Example

from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Generate candidates
generator = EmbeddingCandidateGenerator(device="cpu")
candidates = generator.generate(query=query, source=source, top_k=10)

# candidates is a dict: {query_id: [Candidate, ...]}
for query_id, cands in candidates.items():
    print(f"{query_id}: {len(cands)} candidates")

build_source_index ¶

build_source_index(
    source_segments: Sequence[TextSegment],
    source_embeddings: ndarray,
    collection_name: str = "source_segments",
    batch_size: int = 5000,
) -> Collection

Create an ephemeral Chroma collection from segments and embeddings.

generate ¶

generate(
    *,
    query: Document,
    source: Document,
    top_k: int = 100,
    query_prompt_name: str = "query",
    source_prompt_name: str = "match",
    **kwargs: Any,
) -> CandidateGeneratorOutput

Generate candidates by embedding similarity.

Encodes all segments, indexes the source embeddings, and returns the top_k most similar source segments for each query segment.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`source`	Source document. TYPE: `Document`
`top_k`	Number of most-similar source segments to return per query segment. TYPE: `int` DEFAULT: `100`
`query_prompt_name`	Prompt name passed to the sentence-transformer for query encoding. TYPE: `str` DEFAULT: `'query'`
`source_prompt_name`	Prompt name passed to the sentence-transformer for source encoding. TYPE: `str` DEFAULT: `'match'`

RETURNS	DESCRIPTION
`CandidateGeneratorOutput`	Mapping of query segment IDs → ranked lists of `Candidate`
`CandidateGeneratorOutput`	sorted by descending cosine similarity.

ExhaustiveCandidateGenerator¶

Return all source segments as candidates (no filtering).

locisimiles.pipeline.generator.exhaustive.ExhaustiveCandidateGenerator ¶

Treat every source segment as a candidate for every query segment.

No scoring or ranking is performed. Each Candidate.score is set to 1.0 since all pairs are treated equally.

This generator is typically paired with a judge (e.g. ClassificationJudge) that performs the actual scoring. Best suited for smaller datasets where comparing all pairs is feasible.

Example

from locisimiles.pipeline.generator import ExhaustiveCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Generate all possible pairs
generator = ExhaustiveCandidateGenerator()
candidates = generator.generate(query=query, source=source)

# Total pairs = len(query) × len(source)
total = sum(len(c) for c in candidates.values())
print(f"{total} candidate pairs")

generate ¶

generate(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Return all source segments as candidates for each query segment.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`source`	Source document. TYPE: `Document`

RETURNS	DESCRIPTION
`CandidateGeneratorOutput`	Mapping of query segment IDs → lists of `Candidate` with
`CandidateGeneratorOutput`	`score=1.0`.

RuleBasedCandidateGenerator¶

Generate candidates using lexical matching and linguistic filters.

locisimiles.pipeline.generator.rule_based.RuleBasedCandidateGenerator ¶

RuleBasedCandidateGenerator(
    *,
    min_shared_words: int = 2,
    min_complura: int = 4,
    max_distance: int = 3,
    similarity_threshold: float = 0.3,
    stopwords: Optional[Set[str]] = None,
    use_htrg: bool = False,
    use_similarity: bool = False,
    pos_model: str = "enelpol/evalatin2022-pos-open",
    spacy_model: str = "la_core_web_lg",
    device: Optional[str] = None,
)

Generate candidates using lexical matching and linguistic filters.

This generator implements a multi-stage rule-based approach to detect potential intertextuality between Latin texts. It combines orthographic normalization, shared-word matching, distance criteria, punctuation agreement (scissa), and optional POS / embedding-based filters.

No neural models are required by default. The optional HTRG (Part-of-Speech) filter needs torch and transformers, and the similarity filter needs spacy with a Latin model.

PARAMETER	DESCRIPTION
`min_shared_words`	Minimum number of shared non-stopwords required for a segment pair to be considered a match. TYPE: `int` DEFAULT: `2`
`min_complura`	Minimum adjacent tokens for complura detection. TYPE: `int` DEFAULT: `4`
`max_distance`	Maximum allowed distance between shared words within a segment. TYPE: `int` DEFAULT: `3`
`similarity_threshold`	Cosine similarity threshold for the optional embedding-based filter. TYPE: `float` DEFAULT: `0.3`
`stopwords`	Set of stopwords to exclude from matching. Uses a built-in Latin stopword list if `None`. TYPE: `Optional[Set[str]]` DEFAULT: `None`
`use_htrg`	Enable the HTRG (POS-based) filter. Requires `torch`. TYPE: `bool` DEFAULT: `False`
`use_similarity`	Enable the word-embedding similarity filter. Requires `spacy`. TYPE: `bool` DEFAULT: `False`
`pos_model`	HuggingFace model name for POS tagging. TYPE: `str` DEFAULT: `'enelpol/evalatin2022-pos-open'`
`spacy_model`	spaCy model name for word embeddings. TYPE: `str` DEFAULT: `'la_core_web_lg'`
`device`	Device for neural models (`"cuda"`, `"cpu"`, or `None` for auto-detection). TYPE: `Optional[str]` DEFAULT: `None`

Example

from locisimiles.pipeline.generator import RuleBasedCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Create generator
generator = RuleBasedCandidateGenerator(min_shared_words=3)

# Generate candidates (genre hints improve preprocessing)
candidates = generator.generate(
    query=query,
    source=source,
    query_genre="prose",
    source_genre="poetry",
)

# Optionally load custom stopwords
generator.load_stopwords("my_stopwords.txt")

generate ¶

generate(
    *,
    query: Document,
    source: Document,
    top_k: Optional[int] = None,
    query_genre: str = "prose",
    source_genre: str = "poetry",
    threshold: float = 0.5,
    **kwargs: Any,
) -> CandidateGeneratorOutput

Run the rule-based matching pipeline on query and source documents.

PARAMETER	DESCRIPTION
`query`	Query document (text being analyzed for intertextuality). TYPE: `Document`
`source`	Source document (potential origin of quotations). TYPE: `Document`
`top_k`	Maximum matches per query (`None` = no limit). TYPE: `Optional[int]` DEFAULT: `None`
`query_genre`	Genre of query (`"prose"` or `"poetry"`). TYPE: `str` DEFAULT: `'prose'`
`source_genre`	Genre of source (`"prose"` or `"poetry"`). TYPE: `str` DEFAULT: `'poetry'`
`threshold`	Not used (included for API compatibility). TYPE: `float` DEFAULT: `0.5`

RETURNS	DESCRIPTION
`CandidateGeneratorOutput`	`CandidateGeneratorOutput` mapping query segment IDs to lists
`CandidateGeneratorOutput`	of `Candidate` objects.

load_stopwords ¶

load_stopwords(filepath: Union[str, Path]) -> None

Load stopwords from a file (one word per line).

PARAMETER	DESCRIPTION
`filepath`	Path to stopwords file. TYPE: `Union[str, Path]`

Generators¶

CandidateGeneratorBase¶

locisimiles.pipeline.generator._base.CandidateGeneratorBase ¶

generate abstractmethod ¶

EmbeddingCandidateGenerator¶

locisimiles.pipeline.generator.embedding.EmbeddingCandidateGenerator ¶

build_source_index ¶

generate ¶

ExhaustiveCandidateGenerator¶

locisimiles.pipeline.generator.exhaustive.ExhaustiveCandidateGenerator ¶

generate ¶

RuleBasedCandidateGenerator¶

locisimiles.pipeline.generator.rule_based.RuleBasedCandidateGenerator ¶

generate ¶

load_stopwords ¶

generate `abstractmethod` ¶