Generators¶
Candidate generators narrow the search space by selecting source segments that are most likely to be relevant for each query segment.
All generators inherit from CandidateGeneratorBase and implement a
generate() method returning CandidateGeneratorOutput.
CandidateGeneratorBase¶
locisimiles.pipeline.generator._base.CandidateGeneratorBase
¶
Abstract base class for candidate generators.
A candidate generator narrows the search space by producing a ranked
list of source segments for each query segment. The output is a
CandidateGeneratorOutput — a dictionary mapping query-segment IDs
to lists of Candidate objects, each containing a source segment
and a relevance score.
Subclasses must implement generate().
Available implementations:
EmbeddingCandidateGenerator— semantic similarity via sentence transformers + ChromaDB.ExhaustiveCandidateGenerator— returns all query–source pairs without filtering.RuleBasedCandidateGenerator— lexical matching with linguistic filters for Latin texts.
generate
abstractmethod
¶
generate(
*, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput
Generate candidate segments from source for each query segment.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
source
|
Source document.
TYPE:
|
**kwargs
|
Generator-specific parameters.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateGeneratorOutput
|
Mapping of query segment IDs → lists of |
EmbeddingCandidateGenerator¶
Generate candidates using semantic embedding similarity with sentence transformers and ChromaDB.
locisimiles.pipeline.generator.embedding.EmbeddingCandidateGenerator
¶
EmbeddingCandidateGenerator(
*,
embedding_model_name: str = "julian-schelb/multilingual-e5-large-emb-lat-intertext-v1",
device: str | int | None = None,
)
Generate candidates using semantic embedding similarity.
Encodes query and source segments with a sentence-transformer model, builds an ephemeral ChromaDB index on the source embeddings, and retrieves the most similar source segments for each query segment.
The number of candidates per query is controlled by the top_k
parameter passed to generate().
| PARAMETER | DESCRIPTION |
|---|---|
embedding_model_name
|
HuggingFace model identifier for the sentence-transformer. Defaults to the pre-trained Latin intertextuality model.
TYPE:
|
device
|
Torch device string (
TYPE:
|
Example
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Generate candidates
generator = EmbeddingCandidateGenerator(device="cpu")
candidates = generator.generate(query=query, source=source, top_k=10)
# candidates is a dict: {query_id: [Candidate, ...]}
for query_id, cands in candidates.items():
print(f"{query_id}: {len(cands)} candidates")
build_source_index
¶
build_source_index(
source_segments: Sequence[TextSegment],
source_embeddings: ndarray,
collection_name: str = "source_segments",
batch_size: int = 5000,
) -> Collection
Create an ephemeral Chroma collection from segments and embeddings.
generate
¶
generate(
*,
query: Document,
source: Document,
top_k: int = 100,
query_prompt_name: str = "query",
source_prompt_name: str = "match",
**kwargs: Any,
) -> CandidateGeneratorOutput
Generate candidates by embedding similarity.
Encodes all segments, indexes the source embeddings, and returns
the top_k most similar source segments for each query segment.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
source
|
Source document.
TYPE:
|
top_k
|
Number of most-similar source segments to return per query segment.
TYPE:
|
query_prompt_name
|
Prompt name passed to the sentence-transformer for query encoding.
TYPE:
|
source_prompt_name
|
Prompt name passed to the sentence-transformer for source encoding.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateGeneratorOutput
|
Mapping of query segment IDs → ranked lists of |
CandidateGeneratorOutput
|
sorted by descending cosine similarity. |
ExhaustiveCandidateGenerator¶
Return all source segments as candidates (no filtering).
locisimiles.pipeline.generator.exhaustive.ExhaustiveCandidateGenerator
¶
Treat every source segment as a candidate for every query segment.
No scoring or ranking is performed. Each Candidate.score is set
to 1.0 since all pairs are treated equally.
This generator is typically paired with a judge
(e.g. ClassificationJudge) that performs the actual scoring.
Best suited for smaller datasets where comparing all pairs is
feasible.
Example
from locisimiles.pipeline.generator import ExhaustiveCandidateGenerator
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Generate all possible pairs
generator = ExhaustiveCandidateGenerator()
candidates = generator.generate(query=query, source=source)
# Total pairs = len(query) × len(source)
total = sum(len(c) for c in candidates.values())
print(f"{total} candidate pairs")
generate
¶
generate(
*, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput
Return all source segments as candidates for each query segment.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
source
|
Source document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateGeneratorOutput
|
Mapping of query segment IDs → lists of |
CandidateGeneratorOutput
|
|
RuleBasedCandidateGenerator¶
Generate candidates using lexical matching and linguistic filters.
locisimiles.pipeline.generator.rule_based.RuleBasedCandidateGenerator
¶
RuleBasedCandidateGenerator(
*,
min_shared_words: int = 2,
min_complura: int = 4,
max_distance: int = 3,
similarity_threshold: float = 0.3,
stopwords: Optional[Set[str]] = None,
use_htrg: bool = False,
use_similarity: bool = False,
pos_model: str = "enelpol/evalatin2022-pos-open",
spacy_model: str = "la_core_web_lg",
device: Optional[str] = None,
)
Generate candidates using lexical matching and linguistic filters.
This generator implements a multi-stage rule-based approach to detect potential intertextuality between Latin texts. It combines orthographic normalization, shared-word matching, distance criteria, punctuation agreement (scissa), and optional POS / embedding-based filters.
No neural models are required by default. The optional HTRG
(Part-of-Speech) filter needs torch and transformers, and the
similarity filter needs spacy with a Latin model.
| PARAMETER | DESCRIPTION |
|---|---|
min_shared_words
|
Minimum number of shared non-stopwords required for a segment pair to be considered a match.
TYPE:
|
min_complura
|
Minimum adjacent tokens for complura detection.
TYPE:
|
max_distance
|
Maximum allowed distance between shared words within a segment.
TYPE:
|
similarity_threshold
|
Cosine similarity threshold for the optional embedding-based filter.
TYPE:
|
stopwords
|
Set of stopwords to exclude from matching. Uses a
built-in Latin stopword list if
TYPE:
|
use_htrg
|
Enable the HTRG (POS-based) filter. Requires
TYPE:
|
use_similarity
|
Enable the word-embedding similarity filter.
Requires
TYPE:
|
pos_model
|
HuggingFace model name for POS tagging.
TYPE:
|
spacy_model
|
spaCy model name for word embeddings.
TYPE:
|
device
|
Device for neural models (
TYPE:
|
Example
from locisimiles.pipeline.generator import RuleBasedCandidateGenerator
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Create generator
generator = RuleBasedCandidateGenerator(min_shared_words=3)
# Generate candidates (genre hints improve preprocessing)
candidates = generator.generate(
query=query,
source=source,
query_genre="prose",
source_genre="poetry",
)
# Optionally load custom stopwords
generator.load_stopwords("my_stopwords.txt")
generate
¶
generate(
*,
query: Document,
source: Document,
top_k: Optional[int] = None,
query_genre: str = "prose",
source_genre: str = "poetry",
threshold: float = 0.5,
**kwargs: Any,
) -> CandidateGeneratorOutput
Run the rule-based matching pipeline on query and source documents.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document (text being analyzed for intertextuality).
TYPE:
|
source
|
Source document (potential origin of quotations).
TYPE:
|
top_k
|
Maximum matches per query (
TYPE:
|
query_genre
|
Genre of query (
TYPE:
|
source_genre
|
Genre of source (
TYPE:
|
threshold
|
Not used (included for API compatibility).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateGeneratorOutput
|
|
CandidateGeneratorOutput
|
of |
load_stopwords
¶
Load stopwords from a file (one word per line).
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to stopwords file.
TYPE:
|