Custom Pipelines¶
Build your own pipeline by combining any generator with any judge.
Pipeline¶
locisimiles.pipeline.pipeline.Pipeline
¶
Pipeline(
generator: CandidateGeneratorBase,
judge: CandidateJudgeBase,
)
Compose a candidate generator and a judge into a full pipeline.
This is the recommended way to build custom pipelines. Any
CandidateGeneratorBase can be paired with any JudgeBase.
| PARAMETER | DESCRIPTION |
|---|---|
generator
|
Candidate-generation component.
TYPE:
|
judge
|
Scoring / classification component.
TYPE:
|
Example
from locisimiles.pipeline import Pipeline
from locisimiles.pipeline.generator import EmbeddingCandidateGenerator
from locisimiles.pipeline.judge import ClassificationJudge
from locisimiles.document import Document
# Load documents
query = Document("query.csv")
source = Document("source.csv")
# Build a custom pipeline
pipeline = Pipeline(
generator=EmbeddingCandidateGenerator(device="cpu"),
judge=ClassificationJudge(device="cpu"),
)
# Run pipeline
results = pipeline.run(query=query, source=source, top_k=10)
# Save results
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")
generate_candidates
¶
generate_candidates(
*, query: Document, source: Document, **kwargs: Any
) -> CandidateGeneratorOutput
Run only the candidate-generation stage.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
source
|
Source document.
TYPE:
|
**kwargs
|
Forwarded to the generator's
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateGeneratorOutput
|
|
judge_candidates
¶
judge_candidates(
*,
query: Document,
candidates: CandidateGeneratorOutput,
**kwargs: Any,
) -> CandidateJudgeOutput
Run only the judgment stage on pre-generated candidates.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
candidates
|
Output from a candidate generator.
TYPE:
|
**kwargs
|
Forwarded to the judge's
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateJudgeOutput
|
|
run
¶
run(
*, query: Document, source: Document, **kwargs: Any
) -> CandidateJudgeOutput
Run both stages: generate candidates then judge them.
All kwargs are forwarded to both the generator and the judge; each component ignores keys it does not recognise.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Query document.
TYPE:
|
source
|
Source document.
TYPE:
|
**kwargs
|
Forwarded to both stages.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
CandidateJudgeOutput
|
|
to_csv
¶
to_csv(
path: Union[str, Path],
results: CandidateJudgeOutput | None = None,
) -> None
Save pipeline results to a CSV file.
If results is None, the results from the last run() call
are used.
Columns: query_id, source_id, source_text,
candidate_score, judgment_score.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Destination file path (e.g.
TYPE:
|
results
|
Explicit results to save. Defaults to the last
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no results are available. |
to_json
¶
to_json(
path: Union[str, Path],
results: CandidateJudgeOutput | None = None,
*,
indent: int = 2,
) -> None
Save pipeline results to a JSON file.
If results is None, the results from the last run() call
are used.
Produces a JSON object keyed by query segment ID, where each value
is a list of match objects with source_id, source_text,
candidate_score, and judgment_score.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Destination file path (e.g.
TYPE:
|
results
|
Explicit results to save. Defaults to the last
TYPE:
|
indent
|
JSON indentation level (default
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no results are available. |
Type Definitions¶
Data classes and type aliases used across all pipelines.
locisimiles.pipeline._types
¶
Shared type definitions and utilities for pipeline modules.
This module defines the common data structures used across all pipeline implementations for representing detection results.
Pipeline Architecture¶
Every pipeline follows a two-phase pattern:
- Candidate Generation — narrows the search space, producing a
CandidateGeneratorOutput(mapping of query IDs →Candidatelists). - Judgment — scores or classifies candidate pairs, producing a
CandidateJudgeOutput(mapping of query IDs →CandidateJudgelists).
Candidate
dataclass
¶
Candidate(segment: TextSegment, score: float)
A single candidate match produced by a candidate-generation stage.
| ATTRIBUTE | DESCRIPTION |
|---|---|
segment |
The matching source segment.
TYPE:
|
score |
Relevance score (e.g. cosine similarity, shared-word ratio).
TYPE:
|
Example
CandidateJudge
dataclass
¶
CandidateJudge(
segment: TextSegment,
candidate_score: Optional[float],
judgment_score: float,
)
A single scored candidate after the judgment (classification / filtering) stage.
| ATTRIBUTE | DESCRIPTION |
|---|---|
segment |
The matching source segment.
TYPE:
|
candidate_score |
Score from candidate generation (
TYPE:
|
judgment_score |
Final judgment value — e.g. a classification probability, a binary 1.0/0.0 decision, or a rule-based score.
TYPE:
|
Example
CandidateGeneratorOutput
module-attribute
¶
CandidateGeneratorOutput = Dict[str, List[Candidate]]
Mapping from query segment IDs → ranked lists of Candidate objects.
This is the output type of every candidate-generation stage.
CandidateJudgeOutput
module-attribute
¶
CandidateJudgeOutput = Dict[str, List[CandidateJudge]]
Mapping from query segment IDs → lists of CandidateJudge objects.
This is the standard output type of every pipeline's run() method
and is consumed by the evaluator.
CandidateJudgeInput
module-attribute
¶
CandidateJudgeInput = CandidateGeneratorOutput
Alias: the judge receives exactly what the generator produced.
pretty_print
¶
pretty_print(results: CandidateJudgeOutput) -> None
Print pipeline results in a human-readable format.
Displays each query segment and its candidate matches with candidate scores and judgment scores.
| PARAMETER | DESCRIPTION |
|---|---|
results
|
Pipeline output in
TYPE:
|
results_to_csv
¶
results_to_csv(
results: CandidateJudgeOutput, path: Union[str, Path]
) -> None
Save pipeline results to a CSV file.
Writes one row per query-source match with the following columns:
query_id- identifier of the query segment.source_id- identifier of the matching source segment.source_text- raw text of the source segment.candidate_score- score from the candidate-generation stage (empty when not available).judgment_score- final judgment / classification score.
| PARAMETER | DESCRIPTION |
|---|---|
results
|
Pipeline output in
TYPE:
|
path
|
Destination file path (e.g.
TYPE:
|
results_to_json
¶
results_to_json(
results: CandidateJudgeOutput,
path: Union[str, Path],
*,
indent: int = 2,
) -> None
Save pipeline results to a JSON file.
Produces a JSON object keyed by query segment ID. Each value is a
list of match objects with source_id, source_text,
candidate_score, and judgment_score.
| PARAMETER | DESCRIPTION |
|---|---|
results
|
Pipeline output in
TYPE:
|
path
|
Destination file path (e.g.
TYPE:
|
indent
|
JSON indentation level (default
TYPE:
|