Document Module¶
Classes for representing and loading text collections.
TextSegment¶
An individual unit of text with an identifier.
locisimiles.document.TextSegment
¶
TextSegment(
text: str,
seg_id: ID,
*,
row_id: int | None = None,
meta: Dict[str, Any] | None = None,
)
Atomic unit of text inside a document.
A TextSegment represents a single passage, sentence, or verse from a larger document. Each segment has a unique identifier and optional metadata.
| ATTRIBUTE | DESCRIPTION |
|---|---|
text |
The raw text content of the segment.
TYPE:
|
id |
Unique identifier for the segment (e.g., "verg. aen. 1.1").
TYPE:
|
row_id |
Position of the segment in the original document (0-indexed).
TYPE:
|
meta |
Optional dictionary of additional metadata.
TYPE:
|
Example
Document¶
A collection of text segments with loading utilities.
locisimiles.document.Document
¶
Document(
path: str | Path,
*,
author: str | None = None,
meta: Dict[str, Any] | None = None,
segment_delimiter: str = "\n",
)
Collection of text segments representing a document.
A Document is a container for TextSegments loaded from a file. It supports CSV/TSV files with 'seg_id' and 'text' columns, or plain text files where segments are separated by a delimiter.
| ATTRIBUTE | DESCRIPTION |
|---|---|
path |
Path to the source file.
TYPE:
|
author |
Optional author name for the document.
TYPE:
|
meta |
Optional dictionary of document-level metadata.
TYPE:
|
Example
from locisimiles.document import Document
# Load from CSV (must have 'seg_id' and 'text' columns)
vergil = Document("vergil_samples.csv", author="Vergil")
# Access segments
print(len(vergil)) # Number of segments
print(vergil.ids()) # List of segment IDs
print(vergil.get_text("verg. aen. 1.1")) # Get text by ID
# Iterate over segments
for segment in vergil:
print(f"{segment.id}: {segment.text[:50]}...")
# Add custom segments
vergil.add_segment(
text="Custom text",
seg_id="custom.1",
meta={"source": "manual"}
)
add_segment
¶
add_segment(
text: str,
seg_id: ID,
*,
row_id: int | None = None,
meta: Dict[str, Any] | None = None,
) -> None
Add a new text segment to the document.
statistics
¶
Return descriptive statistics (segment count, char/word totals, averages, min/max).
sentencize
¶
sentencize(
*,
splitter: Optional[Callable[[str], List[str]]] = None,
id_separator: str = ".",
) -> Document
Re-segment this document so that each segment contains exactly one sentence.
All segment texts are first joined (in row-id order) and then sentence-split as a single block. This correctly handles:
- Segments containing multiple sentences → split into separate segments.
- A single sentence spanning multiple rows → merged into one segment.
New segment IDs are derived from the original segment whose text
starts the sentence, with a numeric suffix appended (e.g.
"seg1.1", "seg1.2").
| PARAMETER | DESCRIPTION |
|---|---|
splitter
|
A callable that takes a
TYPE:
|
id_separator
|
Separator inserted between the original
segment ID and the sentence index when a segment is
split (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
The modified |
save_plain
¶
Write all segment texts to a plain-text file.
save_csv
¶
Write all segments to a CSV file with seg_id and text columns.