CLI Reference¶
LociSimiles provides a command-line interface for common workflows.
Installation¶
The CLI is installed automatically with the package:
Commands¶
locisimiles run¶
Run the intertextual detection pipeline on source and target documents.
Arguments¶
| Argument | Description |
|---|---|
SOURCE |
Path to the source CSV file |
TARGET |
Path to the target CSV file |
Options¶
| Option | Default | Description |
|---|---|---|
--output, -o |
results.csv |
Output file path |
--model, -m |
sentence-transformers/all-MiniLM-L6-v2 |
Model name or path |
--top-k, -k |
10 |
Number of candidates to retrieve |
--threshold, -t |
0.85 |
Classification threshold |
--batch-size, -b |
32 |
Batch size for processing |
Examples¶
Basic usage:
With custom output and model:
locisimiles run source.csv target.csv \
--output results.csv \
--model bert-base-multilingual-cased \
--top-k 20
locisimiles evaluate¶
Evaluate detection results against ground truth.
Arguments¶
| Argument | Description |
|---|---|
PREDICTIONS |
Path to predictions CSV file |
GROUND_TRUTH |
Path to ground truth CSV file |
Options¶
| Option | Default | Description |
|---|---|---|
--output, -o |
None |
Output file for metrics (prints to stdout if not specified) |
Examples¶
Save metrics to file:
Input File Formats¶
Source/Target CSV¶
CSV files should contain at minimum an ID column and a text column:
Ground Truth CSV¶
Ground truth files should contain query-reference pairs with labels:
Where label is 1 for true matches and 0 for non-matches.
Output Format¶
The pipeline outputs a CSV with the following columns:
| Column | Description |
|---|---|
query_id |
ID of the source text segment |
reference_id |
ID of the matched target segment |
score |
Similarity/classification score |
above_threshold |
Whether the score exceeds the threshold |
Environment Variables¶
| Variable | Description |
|---|---|
LOCISIMILES_CACHE_DIR |
Directory for model caching |
LOCISIMILES_DEVICE |
Device for computation (cpu, cuda, mps) |