Skip to content

Document Module

Classes for representing and loading text collections.

TextSegment

An individual unit of text with an identifier.

locisimiles.document.TextSegment

TextSegment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
)

Atomic unit of text inside a document.

A TextSegment represents a single passage, sentence, or verse from a larger document. Each segment has a unique identifier and optional metadata.

ATTRIBUTE DESCRIPTION
text

The raw text content of the segment.

TYPE: str

id

Unique identifier for the segment (e.g., "verg. aen. 1.1").

TYPE: ID

row_id

Position of the segment in the original document (0-indexed).

TYPE: int | None

meta

Optional dictionary of additional metadata.

TYPE: Dict[str, Any]

Example
segment = TextSegment(
    text="Arma virumque cano, Troiae qui primus ab oris",
    seg_id="verg. aen. 1.1",
    row_id=0,
    meta={"book": 1, "line": 1}
)
print(segment.text)  # "Arma virumque cano..."
print(segment.id)    # "verg. aen. 1.1"

Document

A collection of text segments with loading utilities.

locisimiles.document.Document

Document(
    path: str | Path,
    *,
    author: str | None = None,
    meta: Dict[str, Any] | None = None,
    segment_delimiter: str = "\n",
)

Collection of text segments representing a document.

A Document is a container for TextSegments loaded from a file. It supports CSV/TSV files with 'seg_id' and 'text' columns, or plain text files where segments are separated by a delimiter.

ATTRIBUTE DESCRIPTION
path

Path to the source file.

TYPE: Path

author

Optional author name for the document.

TYPE: str | None

meta

Optional dictionary of document-level metadata.

TYPE: Dict[str, Any]

Example
from locisimiles.document import Document

# Load from CSV (must have 'seg_id' and 'text' columns)
vergil = Document("vergil_samples.csv", author="Vergil")

# Access segments
print(len(vergil))           # Number of segments
print(vergil.ids())          # List of segment IDs
print(vergil.get_text("verg. aen. 1.1"))  # Get text by ID

# Iterate over segments
for segment in vergil:
    print(f"{segment.id}: {segment.text[:50]}...")

# Add custom segments
vergil.add_segment(
    text="Custom text",
    seg_id="custom.1",
    meta={"source": "manual"}
)

ids

ids() -> List[ID]

Return segment IDs in original order.

get_text

get_text(seg_id: ID) -> str

Return raw text of a segment.

add_segment

add_segment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
) -> None

Add a new text segment to the document.

remove_segment

remove_segment(seg_id: ID) -> None

Delete a segment if present.

statistics

statistics() -> Dict[str, Any]

Return descriptive statistics (segment count, char/word totals, averages, min/max).

sentencize

sentencize(
    *,
    splitter: Optional[Callable[[str], List[str]]] = None,
    id_separator: str = ".",
) -> Document

Re-segment this document so that each segment contains exactly one sentence.

All segment texts are first joined (in row-id order) and then sentence-split as a single block. This correctly handles:

  • Segments containing multiple sentences → split into separate segments.
  • A single sentence spanning multiple rows → merged into one segment.

New segment IDs are derived from the original segment whose text starts the sentence, with a numeric suffix appended (e.g. "seg1.1", "seg1.2").

PARAMETER DESCRIPTION
splitter

A callable that takes a str and returns a list of sentence strings. When None a simple punctuation-based splitter is used. To use spaCy::

import spacy
nlp = spacy.load("la_core_web_lg")
doc.sentencize(splitter=lambda t: [s.text for s in nlp(t).sents])

TYPE: Optional[Callable[[str], List[str]]] DEFAULT: None

id_separator

Separator inserted between the original segment ID and the sentence index when a segment is split (e.g. "seg1""seg1.1", "seg1.2").

TYPE: str DEFAULT: '.'

RETURNS DESCRIPTION
Document

The modified Document with one sentence per segment.

Example
doc = Document("mixed.csv")
doc.sentencize()

save_plain

save_plain(
    path: str | Path, *, delimiter: str = "\n"
) -> Path

Write all segment texts to a plain-text file.

save_csv

save_csv(path: str | Path) -> Path

Write all segments to a CSV file with seg_id and text columns.