CLI Reference¶

LociSimiles provides a single CLI entrypoint:

locisimiles QUERY.csv SOURCE.csv -o RESULTS.csv [OPTIONS]

Installation¶

pip install locisimiles

For Word2Vec retrieval, also install:

pip install "locisimiles[word2vec]"

Arguments¶

Argument	Description
`QUERY.csv`	Path to query CSV file (`seg_id`, `text`)
`SOURCE.csv`	Path to source CSV file (`seg_id`, `text`)

Options¶

Option	Default	Description
`-o, --output`	required	Output CSV path
`--pipeline`	`two-stage`	Pipeline type: `two-stage`, `word2vec-retrieval`, `latin-bert-retrieval`, or `latin-bert-two-stage`
`--classification-model`	`julian-schelb/xlm-roberta-large-class-lat-intertext-v1`	Classifier model (two-stage pipelines only)
`--embedding-model`	`julian-schelb/multilingual-e5-large-emb-lat-intertext-v1`	Embedding model (two-stage only)
`--latin-bert-model`	`xlm-roberta-base`	HuggingFace model for contextual Latin BERT retrieval
`--latin-bert-model-path`	none	Optional local model directory for Latin BERT
`--latin-bert-max-length`	`256`	Max tokenized sequence length for contextual retrieval
`--latin-bert-min-token-length`	`2`	Min token length for contextual scoring
`--latin-bert-disable-stopword-filter`	`False`	Disable built-in Latin stopword filtering
`--word2vec-model-path`	package default path	Local gensim `.model` path (Word2Vec pipeline)
`--word2vec-interval`	`0`	Max token gap for Word2Vec bigrams
`--word2vec-order-free`	`False`	Use order-insensitive bigrams
`-k, --top-k`	`10`	Number of retrieved candidates per query
`-t, --threshold`	`0.85`	Threshold for `above_threshold` in output
`--device`	`auto`	`auto`, `cuda`, `mps`, or `cpu`
`-v, --verbose`	`False`	Verbose logs

Two-Stage Flow¶

locisimiles query.csv source.csv -o results.csv \
    --pipeline two-stage \
    --classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
    --embedding-model julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 \
    --top-k 20 \
    --threshold 0.85

Word2Vec Flow¶

locisimiles query.csv source.csv -o results.csv \
    --pipeline word2vec-retrieval \
    --word2vec-model-path ./models/latin_w2v_bamman_lemma300_100_1.model \
    --word2vec-interval 2 \
    --word2vec-order-free \
    --top-k 20 \
    --threshold 0.85

If --word2vec-model-path is not set, the CLI uses this local default path:

models/latin_w2v_bamman_lemma300_100_1.model

The file must exist on disk. No automatic download is performed.

Word2Vec mode expects pre-lemmatized text in the CSV text column.

Latin BERT Retrieval Flow¶

Token-level contextual similarity using a BERT model (Gong-style approach):

locisimiles query.csv source.csv -o results.csv \
    --pipeline latin-bert-retrieval \
    --latin-bert-model ashleygong03/bamman-burns-latin-bert \
    --latin-bert-max-length 256 \
    --top-k 20 \
    --threshold 0.85

Or use a local model directory:

locisimiles query.csv source.csv -o results.csv \
    --pipeline latin-bert-retrieval \
    --latin-bert-model-path ./models/latinbert \
    --top-k 20

Latin BERT Two-Stage Flow¶

Combines contextual token retrieval with classification:

locisimiles query.csv source.csv -o results.csv \
    --pipeline latin-bert-two-stage \
    --latin-bert-model ashleygong03/bamman-burns-latin-bert \
    --classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
    --top-k 20 \
    --threshold 0.85

Output Format¶

The CLI writes the following columns:

Column	Description
`query_id`	Query segment identifier
`query_text`	Query segment text
`source_id`	Source segment identifier
`source_text`	Source segment text
`similarity`	Candidate similarity score
`probability`	Final stage score (classification or thresholded retrieval score)
`above_threshold`	`Yes` if score >= threshold, else `No`

GUI Equivalent¶

The same Word2Vec settings are available in the GUI under:

Pipeline Configuration
Pipeline Type: Word2Vec Retrieval (Burns-Style)
Word2Vec Model Path / Bigram Interval / Order-Free Bigrams