Document Module¶

Classes for representing and loading text collections.

TextSegment¶

An individual unit of text with an identifier.

locisimiles.document.TextSegment ¶

TextSegment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
)

Atomic unit of text inside a document.

A TextSegment represents a single passage, sentence, or verse from a larger document. Each segment has a unique identifier and optional metadata.

ATTRIBUTE	DESCRIPTION
`text`	The raw text content of the segment. TYPE: `str`
`id`	Unique identifier for the segment (e.g., "verg. aen. 1.1"). TYPE: `ID`
`row_id`	Position of the segment in the original document (0-indexed). TYPE: `int \| None`
`meta`	Optional dictionary of additional metadata. TYPE: `Dict[str, Any]`

Example

segment = TextSegment(
    text="Arma virumque cano, Troiae qui primus ab oris",
    seg_id="verg. aen. 1.1",
    row_id=0,
    meta={"book": 1, "line": 1}
)
print(segment.text)  # "Arma virumque cano..."
print(segment.id)    # "verg. aen. 1.1"

Document¶

A collection of text segments with loading utilities.

locisimiles.document.Document ¶

Document(
    path: str | Path,
    *,
    author: str | None = None,
    meta: Dict[str, Any] | None = None,
    segment_delimiter: str = "\n",
)

Collection of text segments representing a document.

A Document is a container for TextSegments loaded from a file. It supports CSV/TSV files with 'seg_id' and 'text' columns, or plain text files where segments are separated by a delimiter.

ATTRIBUTE	DESCRIPTION
`path`	Path to the source file. TYPE: `Path`
`author`	Optional author name for the document. TYPE: `str \| None`
`meta`	Optional dictionary of document-level metadata. TYPE: `Dict[str, Any]`

Example

from locisimiles.document import Document

# Load from CSV (must have 'seg_id' and 'text' columns)
vergil = Document("vergil_samples.csv", author="Vergil")

# Access segments
print(len(vergil))           # Number of segments
print(vergil.ids())          # List of segment IDs
print(vergil.get_text("verg. aen. 1.1"))  # Get text by ID

# Iterate over segments
for segment in vergil:
    print(f"{segment.id}: {segment.text[:50]}...")

# Add custom segments
vergil.add_segment(
    text="Custom text",
    seg_id="custom.1",
    meta={"source": "manual"}
)

ids ¶

ids() -> List[ID]

Return segment IDs in original order.

get_text ¶

get_text(seg_id: ID) -> str

Return raw text of a segment.

add_segment ¶

add_segment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
) -> None

Add a new text segment to the document.

remove_segment ¶

remove_segment(seg_id: ID) -> None

Delete a segment if present.

statistics ¶

statistics() -> Dict[str, Any]

Return descriptive statistics (segment count, char/word totals, averages, min/max).

sentencize ¶

sentencize(
    *,
    splitter: Optional[Callable[[str], List[str]]] = None,
    id_separator: str = ".",
) -> Document

Re-segment this document so that each segment contains exactly one sentence.

All segment texts are first joined (in row-id order) and then sentence-split as a single block. This correctly handles:

Segments containing multiple sentences → split into separate segments.
A single sentence spanning multiple rows → merged into one segment.

New segment IDs are derived from the original segment whose text starts the sentence, with a numeric suffix appended (e.g. "seg1.1", "seg1.2").

PARAMETER	DESCRIPTION
`splitter`	A callable that takes a `str` and returns a list of sentence strings. When `None` a simple punctuation-based splitter is used. To use spaCy:: `import spacy nlp = spacy.load("la_core_web_lg") doc.sentencize(splitter=lambda t: [s.text for s in nlp(t).sents])` TYPE: `Optional[Callable[[str], List[str]]]` DEFAULT: `None`
`id_separator`	Separator inserted between the original segment ID and the sentence index when a segment is split (e.g. `"seg1"` → `"seg1.1"`, `"seg1.2"`). TYPE: `str` DEFAULT: `'.'`

RETURNS	DESCRIPTION
`Document`	The modified `Document` with one sentence per segment.

Example

doc = Document("mixed.csv")
doc.sentencize()

save_plain ¶

save_plain(
    path: str | Path, *, delimiter: str = "\n"
) -> Path

Write all segment texts to a plain-text file.

save_csv ¶

save_csv(path: str | Path) -> Path

Write all segments to a CSV file with seg_id and text columns.