Custom Pipelines¶

Build your own pipeline by combining any generator with any judge.

Pipeline¶

locisimiles.pipeline.pipeline.Pipeline ¶

Pipeline(
    generator: CandidateGeneratorBase,
    judge: CandidateJudgeBase,
)

Compose a candidate generator and a judge into a full pipeline.

This is the recommended way to build custom pipelines. Any CandidateGeneratorBase can be paired with any JudgeBase.

PARAMETER	DESCRIPTION
`generator`	Candidate-generation component. TYPE: `CandidateGeneratorBase`
`judge`	Scoring / classification component. TYPE: `CandidateJudgeBase`

Example

from locisimiles.pipeline import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Build a custom pipeline
pipeline = Pipeline(
    generator=EmbeddingCandidateGenerator(device="cpu"),
    judge=ClassificationJudge(device="cpu"),
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

# Save results
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")

generate_candidates ¶

generate_candidates(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Run only the candidate-generation stage.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`source`	Source document. TYPE: `Document`
`**kwargs`	Forwarded to the generator's `generate()` method. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`CandidateGeneratorOutput`	`CandidateGeneratorOutput` mapping query IDs → `Candidate` lists.

judge_candidates ¶

judge_candidates(
    *,
    query: Document,
    candidates: CandidateGeneratorOutput,
    **kwargs: Any,
) -> CandidateJudgeOutput

Run only the judgment stage on pre-generated candidates.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`candidates`	Output from a candidate generator. TYPE: `CandidateGeneratorOutput`
`**kwargs`	Forwarded to the judge's `judge()` method. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`CandidateJudgeOutput`	`CandidateJudgeOutput` mapping query IDs → `CandidateJudge` lists.

run ¶

run(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateJudgeOutput

Run both stages: generate candidates then judge them.

All kwargs are forwarded to both the generator and the judge; each component ignores keys it does not recognise.

PARAMETER	DESCRIPTION
`query`	Query document. TYPE: `Document`
`source`	Source document. TYPE: `Document`
`**kwargs`	Forwarded to both stages. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`CandidateJudgeOutput`	`CandidateJudgeOutput` with judgment scores.

to_csv ¶

to_csv(
    path: Union[str, Path],
    results: CandidateJudgeOutput | None = None,
) -> None

Save pipeline results to a CSV file.

If results is None, the results from the last run() call are used.

Columns: query_id, source_id, source_text, candidate_score, judgment_score.

PARAMETER	DESCRIPTION
`path`	Destination file path (e.g. `"results.csv"`). TYPE: `Union[str, Path]`
`results`	Explicit results to save. Defaults to the last `run()` output. TYPE: `CandidateJudgeOutput \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If no results are available.

Example

results = pipeline.run(query=query_doc, source=source_doc)
pipeline.to_csv("results.csv")

to_json ¶

to_json(
    path: Union[str, Path],
    results: CandidateJudgeOutput | None = None,
    *,
    indent: int = 2,
) -> None

Save pipeline results to a JSON file.

If results is None, the results from the last run() call are used.

Produces a JSON object keyed by query segment ID, where each value is a list of match objects with source_id, source_text, candidate_score, and judgment_score.

PARAMETER	DESCRIPTION
`path`	Destination file path (e.g. `"results.json"`). TYPE: `Union[str, Path]`
`results`	Explicit results to save. Defaults to the last `run()` output. TYPE: `CandidateJudgeOutput \| None` DEFAULT: `None`
`indent`	JSON indentation level (default `2`). TYPE: `int` DEFAULT: `2`

RAISES	DESCRIPTION
`ValueError`	If no results are available.

Example

results = pipeline.run(query=query_doc, source=source_doc)
pipeline.to_json("results.json")

Type Definitions¶

Data classes and type aliases used across all pipelines.

locisimiles.pipeline._types ¶

Shared type definitions and utilities for pipeline modules.

This module defines the common data structures used across all pipeline implementations for representing detection results.

Pipeline Architecture¶

Every pipeline follows a two-phase pattern:

Candidate Generation — narrows the search space, producing a CandidateGeneratorOutput (mapping of query IDs → Candidate lists).
Judgment — scores or classifies candidate pairs, producing a CandidateJudgeOutput (mapping of query IDs → CandidateJudge lists).

Candidate `dataclass` ¶

Candidate(segment: TextSegment, score: float)

A single candidate match produced by a candidate-generation stage.

ATTRIBUTE	DESCRIPTION
`segment`	The matching source segment. TYPE: `TextSegment`
`score`	Relevance score (e.g. cosine similarity, shared-word ratio). TYPE: `float`

Example

from locisimiles.pipeline import Candidate
from locisimiles.document import TextSegment

candidate = Candidate(
    segment=TextSegment("Arma virumque cano", seg_id="verg. aen. 1.1"),
    score=0.85,
)
print(candidate.segment.id)  # "verg. aen. 1.1"
print(candidate.score)       # 0.85

CandidateJudge `dataclass` ¶

CandidateJudge(
    segment: TextSegment,
    candidate_score: Optional[float],
    judgment_score: float,
)

A single scored candidate after the judgment (classification / filtering) stage.

ATTRIBUTE	DESCRIPTION
`segment`	The matching source segment. TYPE: `TextSegment`
`candidate_score`	Score from candidate generation (`None` when the generator is exhaustive, i.e. all pairs are candidates). TYPE: `Optional[float]`
`judgment_score`	Final judgment value — e.g. a classification probability, a binary 1.0/0.0 decision, or a rule-based score. TYPE: `float`

Example

from locisimiles.pipeline import CandidateJudge
from locisimiles.document import TextSegment

result = CandidateJudge(
    segment=TextSegment("Arma virumque cano", seg_id="verg. aen. 1.1"),
    candidate_score=0.85,
    judgment_score=0.95,
)
print(result.judgment_score)  # 0.95

CandidateGeneratorOutput `module-attribute` ¶

CandidateGeneratorOutput = Dict[str, List[Candidate]]

Mapping from query segment IDs → ranked lists of Candidate objects.

This is the output type of every candidate-generation stage.

CandidateJudgeOutput `module-attribute` ¶

CandidateJudgeOutput = Dict[str, List[CandidateJudge]]

Mapping from query segment IDs → lists of CandidateJudge objects.

This is the standard output type of every pipeline's run() method and is consumed by the evaluator.

CandidateJudgeInput `module-attribute` ¶

CandidateJudgeInput = CandidateGeneratorOutput

Alias: the judge receives exactly what the generator produced.

pretty_print ¶

pretty_print(results: CandidateJudgeOutput) -> None

Print pipeline results in a human-readable format.

Displays each query segment and its candidate matches with candidate scores and judgment scores.

PARAMETER	DESCRIPTION
`results`	Pipeline output in `CandidateJudgeOutput` format. TYPE: `CandidateJudgeOutput`

Example

from locisimiles.pipeline import pretty_print

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
pretty_print(results)

# Output:
# ▶ Query segment 'hier. adv. iovin. 1.41':
#   verg. aen. 1.1              candidate=+0.823  judgment=0.951
#   verg. aen. 2.45             candidate=+0.654  judgment=0.234

results_to_csv ¶

results_to_csv(
    results: CandidateJudgeOutput, path: Union[str, Path]
) -> None

Save pipeline results to a CSV file.

Writes one row per query-source match with the following columns:

query_id - identifier of the query segment.
source_id - identifier of the matching source segment.
source_text - raw text of the source segment.
candidate_score - score from the candidate-generation stage (empty when not available).
judgment_score - final judgment / classification score.

PARAMETER	DESCRIPTION
`results`	Pipeline output in `CandidateJudgeOutput` format. TYPE: `CandidateJudgeOutput`
`path`	Destination file path (e.g. `"results.csv"`). TYPE: `Union[str, Path]`

Example

from locisimiles.pipeline import results_to_csv

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
results_to_csv(results, "results.csv")

results_to_json ¶

results_to_json(
    results: CandidateJudgeOutput,
    path: Union[str, Path],
    *,
    indent: int = 2,
) -> None

Save pipeline results to a JSON file.

Produces a JSON object keyed by query segment ID. Each value is a list of match objects with source_id, source_text, candidate_score, and judgment_score.

PARAMETER	DESCRIPTION
`results`	Pipeline output in `CandidateJudgeOutput` format. TYPE: `CandidateJudgeOutput`
`path`	Destination file path (e.g. `"results.json"`). TYPE: `Union[str, Path]`
`indent`	JSON indentation level (default `2`). TYPE: `int` DEFAULT: `2`

Example

from locisimiles.pipeline import results_to_json

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
results_to_json(results, "results.json")

Custom Pipelines¶

Pipeline¶

locisimiles.pipeline.pipeline.Pipeline ¶

generate_candidates ¶

judge_candidates ¶

run ¶

to_csv ¶

to_json ¶

Type Definitions¶

locisimiles.pipeline._types ¶

Pipeline Architecture¶

Candidate dataclass ¶

CandidateJudge dataclass ¶

CandidateGeneratorOutput module-attribute ¶

CandidateJudgeOutput module-attribute ¶

CandidateJudgeInput module-attribute ¶

pretty_print ¶

results_to_csv ¶

results_to_json ¶

Candidate `dataclass` ¶

CandidateJudge `dataclass` ¶

CandidateGeneratorOutput `module-attribute` ¶

CandidateJudgeOutput `module-attribute` ¶

CandidateJudgeInput `module-attribute` ¶