Skip to content

Generators

Candidate generators narrow the search space by selecting source segments that are most likely to be relevant for each query segment.

All generators inherit from CandidateGeneratorBase and implement a generate() method returning CandidateGeneratorOutput.

CandidateGeneratorBase

locisimiles.pipeline.generator._base.CandidateGeneratorBase

Abstract base class for candidate generators.

A candidate generator narrows the search space by producing a ranked list of source segments for each query segment. The output is a CandidateGeneratorOutput — a dictionary mapping query-segment IDs to lists of Candidate objects, each containing a source segment and a relevance score.

Subclasses must implement generate().

Available implementations:

  • EmbeddingCandidateGenerator — semantic similarity via sentence transformers + ChromaDB.
  • ExhaustiveCandidateGenerator — returns all query–source pairs without filtering.
  • RuleBasedCandidateGenerator — lexical matching with linguistic filters for Latin texts.

generate abstractmethod

generate(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Generate candidate segments from source for each query segment.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

source

Source document.

TYPE: Document

**kwargs

Generator-specific parameters.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
CandidateGeneratorOutput

Mapping of query segment IDs → lists of Candidate objects.

EmbeddingCandidateGenerator

Generate candidates using semantic embedding similarity with sentence transformers and ChromaDB.

locisimiles.pipeline.generator.embedding.EmbeddingCandidateGenerator

EmbeddingCandidateGenerator(
    *,
    embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
    device: str | int | None = None,
)

Generate candidates using semantic embedding similarity.

Encodes query and source segments with a sentence-transformer model, builds an ephemeral ChromaDB index on the source embeddings, and retrieves the most similar source segments for each query segment.

The number of candidates per query is controlled by the top_k parameter passed to generate().

PARAMETER DESCRIPTION
embedding_model_name

HuggingFace model identifier for the sentence-transformer. Defaults to the pre-trained Latin intertextuality model.

TYPE: str DEFAULT: 'julian-schelb/multilingual-e5-large-emb-lat-intertext-v1'

device

Torch device string ("cpu", "cuda", "mps").

TYPE: str | int | None DEFAULT: None

Example
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Generate candidates
generator = EmbeddingCandidateGenerator(device="cpu")
candidates = generator.generate(query=query, source=source, top_k=10)

# candidates is a dict: {query_id: [Candidate, ...]}
for query_id, cands in candidates.items():
    print(f"{query_id}: {len(cands)} candidates")

build_source_index

build_source_index(
    source_segments: Sequence[TextSegment],
    source_embeddings: ndarray,
    collection_name: str = "source_segments",
    batch_size: int = 5000,
) -> Collection

Create an ephemeral Chroma collection from segments and embeddings.

generate

generate(
    *,
    query: Document,
    source: Document,
    top_k: int = 100,
    query_prompt_name: str = "query",
    source_prompt_name: str = "match",
    **kwargs: Any,
) -> CandidateGeneratorOutput

Generate candidates by embedding similarity.

Encodes all segments, indexes the source embeddings, and returns the top_k most similar source segments for each query segment.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

source

Source document.

TYPE: Document

top_k

Number of most-similar source segments to return per query segment.

TYPE: int DEFAULT: 100

query_prompt_name

Prompt name passed to the sentence-transformer for query encoding.

TYPE: str DEFAULT: 'query'

source_prompt_name

Prompt name passed to the sentence-transformer for source encoding.

TYPE: str DEFAULT: 'match'

RETURNS DESCRIPTION
CandidateGeneratorOutput

Mapping of query segment IDs → ranked lists of Candidate

CandidateGeneratorOutput

sorted by descending cosine similarity.

ExhaustiveCandidateGenerator

Return all source segments as candidates (no filtering).

locisimiles.pipeline.generator.exhaustive.ExhaustiveCandidateGenerator

Treat every source segment as a candidate for every query segment.

No scoring or ranking is performed. Each Candidate.score is set to 1.0 since all pairs are treated equally.

This generator is typically paired with a judge (e.g. ClassificationJudge) that performs the actual scoring. Best suited for smaller datasets where comparing all pairs is feasible.

Example
from locisimiles.pipeline.generator import ExhaustiveCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Generate all possible pairs
generator = ExhaustiveCandidateGenerator()
candidates = generator.generate(query=query, source=source)

# Total pairs = len(query) × len(source)
total = sum(len(c) for c in candidates.values())
print(f"{total} candidate pairs")

generate

generate(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Return all source segments as candidates for each query segment.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

source

Source document.

TYPE: Document

RETURNS DESCRIPTION
CandidateGeneratorOutput

Mapping of query segment IDs → lists of Candidate with

CandidateGeneratorOutput

score=1.0.

RuleBasedCandidateGenerator

Generate candidates using lexical matching and linguistic filters.

locisimiles.pipeline.generator.rule_based.RuleBasedCandidateGenerator

RuleBasedCandidateGenerator(
    *,
    min_shared_words: int = 2,
    min_complura: int = 4,
    max_distance: int = 3,
    similarity_threshold: float = 0.3,
    stopwords: Optional[Set[str]] = None,
    use_htrg: bool = False,
    use_similarity: bool = False,
    pos_model: str = "enelpol/evalatin2022-pos-open",
    spacy_model: str = "la_core_web_lg",
    device: Optional[str] = None,
)

Generate candidates using lexical matching and linguistic filters.

This generator implements a multi-stage rule-based approach to detect potential intertextuality between Latin texts. It combines orthographic normalization, shared-word matching, distance criteria, punctuation agreement (scissa), and optional POS / embedding-based filters.

No neural models are required by default. The optional HTRG (Part-of-Speech) filter needs torch and transformers, and the similarity filter needs spacy with a Latin model.

PARAMETER DESCRIPTION
min_shared_words

Minimum number of shared non-stopwords required for a segment pair to be considered a match.

TYPE: int DEFAULT: 2

min_complura

Minimum adjacent tokens for complura detection.

TYPE: int DEFAULT: 4

max_distance

Maximum allowed distance between shared words within a segment.

TYPE: int DEFAULT: 3

similarity_threshold

Cosine similarity threshold for the optional embedding-based filter.

TYPE: float DEFAULT: 0.3

stopwords

Set of stopwords to exclude from matching. Uses a built-in Latin stopword list if None.

TYPE: Optional[Set[str]] DEFAULT: None

use_htrg

Enable the HTRG (POS-based) filter. Requires torch.

TYPE: bool DEFAULT: False

use_similarity

Enable the word-embedding similarity filter. Requires spacy.

TYPE: bool DEFAULT: False

pos_model

HuggingFace model name for POS tagging.

TYPE: str DEFAULT: 'enelpol/evalatin2022-pos-open'

spacy_model

spaCy model name for word embeddings.

TYPE: str DEFAULT: 'la_core_web_lg'

device

Device for neural models ("cuda", "cpu", or None for auto-detection).

TYPE: Optional[str] DEFAULT: None

Example
from locisimiles.pipeline.generator import RuleBasedCandidateGenerator
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Create generator
generator = RuleBasedCandidateGenerator(min_shared_words=3)

# Generate candidates (genre hints improve preprocessing)
candidates = generator.generate(
    query=query,
    source=source,
    query_genre="prose",
    source_genre="poetry",
)

# Optionally load custom stopwords
generator.load_stopwords("my_stopwords.txt")

generate

generate(
    *,
    query: Document,
    source: Document,
    top_k: Optional[int] = None,
    query_genre: str = "prose",
    source_genre: str = "poetry",
    threshold: float = 0.5,
    **kwargs: Any,
) -> CandidateGeneratorOutput

Run the rule-based matching pipeline on query and source documents.

PARAMETER DESCRIPTION
query

Query document (text being analyzed for intertextuality).

TYPE: Document

source

Source document (potential origin of quotations).

TYPE: Document

top_k

Maximum matches per query (None = no limit).

TYPE: Optional[int] DEFAULT: None

query_genre

Genre of query ("prose" or "poetry").

TYPE: str DEFAULT: 'prose'

source_genre

Genre of source ("prose" or "poetry").

TYPE: str DEFAULT: 'poetry'

threshold

Not used (included for API compatibility).

TYPE: float DEFAULT: 0.5

RETURNS DESCRIPTION
CandidateGeneratorOutput

CandidateGeneratorOutput mapping query segment IDs to lists

CandidateGeneratorOutput

of Candidate objects.

load_stopwords

load_stopwords(filepath: Union[str, Path]) -> None

Load stopwords from a file (one word per line).

PARAMETER DESCRIPTION
filepath

Path to stopwords file.

TYPE: Union[str, Path]