CLI Reference¶
LociSimiles provides a single CLI entrypoint:
Installation¶
For Word2Vec retrieval, also install:
Arguments¶
| Argument | Description |
|---|---|
QUERY.csv |
Path to query CSV file (seg_id, text) |
SOURCE.csv |
Path to source CSV file (seg_id, text) |
Options¶
| Option | Default | Description |
|---|---|---|
-o, --output |
required | Output CSV path |
--pipeline |
two-stage |
Pipeline type: two-stage, word2vec-retrieval, latin-bert-retrieval, or latin-bert-two-stage |
--classification-model |
julian-schelb/xlm-roberta-large-class-lat-intertext-v1 |
Classifier model (two-stage pipelines only) |
--embedding-model |
julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 |
Embedding model (two-stage only) |
--latin-bert-model |
xlm-roberta-base |
HuggingFace model for contextual Latin BERT retrieval |
--latin-bert-model-path |
none | Optional local model directory for Latin BERT |
--latin-bert-max-length |
256 |
Max tokenized sequence length for contextual retrieval |
--latin-bert-min-token-length |
2 |
Min token length for contextual scoring |
--latin-bert-disable-stopword-filter |
False |
Disable built-in Latin stopword filtering |
--word2vec-model-path |
package default path | Local gensim .model path (Word2Vec pipeline) |
--word2vec-interval |
0 |
Max token gap for Word2Vec bigrams |
--word2vec-order-free |
False |
Use order-insensitive bigrams |
-k, --top-k |
10 |
Number of retrieved candidates per query |
-t, --threshold |
0.85 |
Threshold for above_threshold in output |
--device |
auto |
auto, cuda, mps, or cpu |
-v, --verbose |
False |
Verbose logs |
Two-Stage Flow¶
locisimiles query.csv source.csv -o results.csv \
--pipeline two-stage \
--classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
--embedding-model julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 \
--top-k 20 \
--threshold 0.85
Word2Vec Flow¶
locisimiles query.csv source.csv -o results.csv \
--pipeline word2vec-retrieval \
--word2vec-model-path ./models/latin_w2v_bamman_lemma300_100_1.model \
--word2vec-interval 2 \
--word2vec-order-free \
--top-k 20 \
--threshold 0.85
If --word2vec-model-path is not set, the CLI uses this local default path:
models/latin_w2v_bamman_lemma300_100_1.model
The file must exist on disk. No automatic download is performed.
Word2Vec mode expects pre-lemmatized text in the CSV text column.
Latin BERT Retrieval Flow¶
Token-level contextual similarity using a BERT model (Gong-style approach):
locisimiles query.csv source.csv -o results.csv \
--pipeline latin-bert-retrieval \
--latin-bert-model ashleygong03/bamman-burns-latin-bert \
--latin-bert-max-length 256 \
--top-k 20 \
--threshold 0.85
Or use a local model directory:
locisimiles query.csv source.csv -o results.csv \
--pipeline latin-bert-retrieval \
--latin-bert-model-path ./models/latinbert \
--top-k 20
Latin BERT Two-Stage Flow¶
Combines contextual token retrieval with classification:
locisimiles query.csv source.csv -o results.csv \
--pipeline latin-bert-two-stage \
--latin-bert-model ashleygong03/bamman-burns-latin-bert \
--classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
--top-k 20 \
--threshold 0.85
Output Format¶
The CLI writes the following columns:
| Column | Description |
|---|---|
query_id |
Query segment identifier |
query_text |
Query segment text |
source_id |
Source segment identifier |
source_text |
Source segment text |
similarity |
Candidate similarity score |
probability |
Final stage score (classification or thresholded retrieval score) |
above_threshold |
Yes if score >= threshold, else No |
GUI Equivalent¶
The same Word2Vec settings are available in the GUI under:
- Pipeline Configuration
- Pipeline Type: Word2Vec Retrieval (Burns-Style)
- Word2Vec Model Path / Bigram Interval / Order-Free Bigrams