Skip to content

Custom Pipelines

Build your own pipeline by combining any generator with any judge.

Pipeline

locisimiles.pipeline.pipeline.Pipeline

Pipeline(
    generator: CandidateGeneratorBase,
    judge: CandidateJudgeBase,
)

Compose a candidate generator and a judge into a full pipeline.

This is the recommended way to build custom pipelines. Any CandidateGeneratorBase can be paired with any JudgeBase.

PARAMETER DESCRIPTION
generator

Candidate-generation component.

TYPE: CandidateGeneratorBase

judge

Scoring / classification component.

TYPE: CandidateJudgeBase

Example
from locisimiles.pipeline import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
from locisimiles.document import Document

# Load documents
query = Document("query.csv")
source = Document("source.csv")

# Build a custom pipeline
pipeline = Pipeline(
    generator=EmbeddingCandidateGenerator(device="cpu"),
    judge=ClassificationJudge(device="cpu"),
)

# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)

# Save results
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")

generate_candidates

generate_candidates(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput

Run only the candidate-generation stage.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

source

Source document.

TYPE: Document

**kwargs

Forwarded to the generator's generate() method.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
CandidateGeneratorOutput

CandidateGeneratorOutput mapping query IDs → Candidate lists.

judge_candidates

judge_candidates(
    *,
    query: Document,
    candidates: CandidateGeneratorOutput,
    **kwargs: Any,
) -> CandidateJudgeOutput

Run only the judgment stage on pre-generated candidates.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

candidates

Output from a candidate generator.

TYPE: CandidateGeneratorOutput

**kwargs

Forwarded to the judge's judge() method.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
CandidateJudgeOutput

CandidateJudgeOutput mapping query IDs → CandidateJudge lists.

run

run(
    *, query: Document, source: Document, **kwargs: Any
) -> CandidateJudgeOutput

Run both stages: generate candidates then judge them.

All kwargs are forwarded to both the generator and the judge; each component ignores keys it does not recognise.

PARAMETER DESCRIPTION
query

Query document.

TYPE: Document

source

Source document.

TYPE: Document

**kwargs

Forwarded to both stages.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
CandidateJudgeOutput

CandidateJudgeOutput with judgment scores.

to_csv

to_csv(
    path: Union[str, Path],
    results: CandidateJudgeOutput | None = None,
) -> None

Save pipeline results to a CSV file.

If results is None, the results from the last run() call are used.

Columns: query_id, source_id, source_text, candidate_score, judgment_score.

PARAMETER DESCRIPTION
path

Destination file path (e.g. "results.csv").

TYPE: Union[str, Path]

results

Explicit results to save. Defaults to the last run() output.

TYPE: CandidateJudgeOutput | None DEFAULT: None

RAISES DESCRIPTION
ValueError

If no results are available.

Example
results = pipeline.run(query=query_doc, source=source_doc)
pipeline.to_csv("results.csv")

to_json

to_json(
    path: Union[str, Path],
    results: CandidateJudgeOutput | None = None,
    *,
    indent: int = 2,
) -> None

Save pipeline results to a JSON file.

If results is None, the results from the last run() call are used.

Produces a JSON object keyed by query segment ID, where each value is a list of match objects with source_id, source_text, candidate_score, and judgment_score.

PARAMETER DESCRIPTION
path

Destination file path (e.g. "results.json").

TYPE: Union[str, Path]

results

Explicit results to save. Defaults to the last run() output.

TYPE: CandidateJudgeOutput | None DEFAULT: None

indent

JSON indentation level (default 2).

TYPE: int DEFAULT: 2

RAISES DESCRIPTION
ValueError

If no results are available.

Example
results = pipeline.run(query=query_doc, source=source_doc)
pipeline.to_json("results.json")

Type Definitions

Data classes and type aliases used across all pipelines.

locisimiles.pipeline._types

Shared type definitions and utilities for pipeline modules.

This module defines the common data structures used across all pipeline implementations for representing detection results.

Pipeline Architecture

Every pipeline follows a two-phase pattern:

  1. Candidate Generation — narrows the search space, producing a CandidateGeneratorOutput (mapping of query IDs → Candidate lists).
  2. Judgment — scores or classifies candidate pairs, producing a CandidateJudgeOutput (mapping of query IDs → CandidateJudge lists).

Candidate dataclass

Candidate(segment: TextSegment, score: float)

A single candidate match produced by a candidate-generation stage.

ATTRIBUTE DESCRIPTION
segment

The matching source segment.

TYPE: TextSegment

score

Relevance score (e.g. cosine similarity, shared-word ratio).

TYPE: float

Example
from locisimiles.pipeline import Candidate
from locisimiles.document import TextSegment

candidate = Candidate(
    segment=TextSegment("Arma virumque cano", seg_id="verg. aen. 1.1"),
    score=0.85,
)
print(candidate.segment.id)  # "verg. aen. 1.1"
print(candidate.score)       # 0.85

CandidateJudge dataclass

CandidateJudge(
    segment: TextSegment,
    candidate_score: Optional[float],
    judgment_score: float,
)

A single scored candidate after the judgment (classification / filtering) stage.

ATTRIBUTE DESCRIPTION
segment

The matching source segment.

TYPE: TextSegment

candidate_score

Score from candidate generation (None when the generator is exhaustive, i.e. all pairs are candidates).

TYPE: Optional[float]

judgment_score

Final judgment value — e.g. a classification probability, a binary 1.0/0.0 decision, or a rule-based score.

TYPE: float

Example
from locisimiles.pipeline import CandidateJudge
from locisimiles.document import TextSegment

result = CandidateJudge(
    segment=TextSegment("Arma virumque cano", seg_id="verg. aen. 1.1"),
    candidate_score=0.85,
    judgment_score=0.95,
)
print(result.judgment_score)  # 0.95

CandidateGeneratorOutput module-attribute

CandidateGeneratorOutput = Dict[str, List[Candidate]]

Mapping from query segment IDs → ranked lists of Candidate objects.

This is the output type of every candidate-generation stage.

CandidateJudgeOutput module-attribute

CandidateJudgeOutput = Dict[str, List[CandidateJudge]]

Mapping from query segment IDs → lists of CandidateJudge objects.

This is the standard output type of every pipeline's run() method and is consumed by the evaluator.

CandidateJudgeInput module-attribute

CandidateJudgeInput = CandidateGeneratorOutput

Alias: the judge receives exactly what the generator produced.

pretty_print

pretty_print(results: CandidateJudgeOutput) -> None

Print pipeline results in a human-readable format.

Displays each query segment and its candidate matches with candidate scores and judgment scores.

PARAMETER DESCRIPTION
results

Pipeline output in CandidateJudgeOutput format.

TYPE: CandidateJudgeOutput

Example
from locisimiles.pipeline import pretty_print

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
pretty_print(results)

# Output:
# ▶ Query segment 'hier. adv. iovin. 1.41':
#   verg. aen. 1.1              candidate=+0.823  judgment=0.951
#   verg. aen. 2.45             candidate=+0.654  judgment=0.234

results_to_csv

results_to_csv(
    results: CandidateJudgeOutput, path: Union[str, Path]
) -> None

Save pipeline results to a CSV file.

Writes one row per query-source match with the following columns:

  • query_id - identifier of the query segment.
  • source_id - identifier of the matching source segment.
  • source_text - raw text of the source segment.
  • candidate_score - score from the candidate-generation stage (empty when not available).
  • judgment_score - final judgment / classification score.
PARAMETER DESCRIPTION
results

Pipeline output in CandidateJudgeOutput format.

TYPE: CandidateJudgeOutput

path

Destination file path (e.g. "results.csv").

TYPE: Union[str, Path]

Example
from locisimiles.pipeline import results_to_csv

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
results_to_csv(results, "results.csv")

results_to_json

results_to_json(
    results: CandidateJudgeOutput,
    path: Union[str, Path],
    *,
    indent: int = 2,
) -> None

Save pipeline results to a JSON file.

Produces a JSON object keyed by query segment ID. Each value is a list of match objects with source_id, source_text, candidate_score, and judgment_score.

PARAMETER DESCRIPTION
results

Pipeline output in CandidateJudgeOutput format.

TYPE: CandidateJudgeOutput

path

Destination file path (e.g. "results.json").

TYPE: Union[str, Path]

indent

JSON indentation level (default 2).

TYPE: int DEFAULT: 2

Example
from locisimiles.pipeline import results_to_json

results = pipeline.run(query=query_doc, source=source_doc, top_k=5)
results_to_json(results, "results.json")