Document Module¶
Classes for representing and loading text collections.
TextSegment¶
An individual unit of text with an identifier.
locisimiles.document.TextSegment
¶
TextSegment(
text: str,
seg_id: ID,
*,
row_id: int | None = None,
meta: Dict[str, Any] | None = None,
)
Atomic unit of text inside a document.
A TextSegment represents a single passage, sentence, or verse from a larger document. Each segment has a unique identifier and optional metadata.
| ATTRIBUTE | DESCRIPTION |
|---|---|
text |
The raw text content of the segment.
TYPE:
|
id |
Unique identifier for the segment (e.g., "verg. aen. 1.1").
TYPE:
|
row_id |
Position of the segment in the original document (0-indexed).
TYPE:
|
meta |
Optional dictionary of additional metadata.
TYPE:
|
Example
Document¶
A collection of text segments with loading utilities.
locisimiles.document.Document
¶
Document(
path: str | Path,
*,
author: str | None = None,
meta: Dict[str, Any] | None = None,
segment_delimiter: str = "\n",
)
Collection of text segments representing a document.
A Document is a container for TextSegments loaded from a file. It supports CSV/TSV files with 'seg_id' and 'text' columns, or plain text files where segments are separated by a delimiter.
| ATTRIBUTE | DESCRIPTION |
|---|---|
path |
Path to the source file.
TYPE:
|
author |
Optional author name for the document.
TYPE:
|
meta |
Optional dictionary of document-level metadata.
TYPE:
|
Example
from locisimiles.document import Document
# Load from CSV (must have 'seg_id' and 'text' columns)
vergil = Document("vergil_samples.csv", author="Vergil")
# Access segments
print(len(vergil)) # Number of segments
print(vergil.ids()) # List of segment IDs
print(vergil.get_text("verg. aen. 1.1")) # Get text by ID
# Iterate over segments
for segment in vergil:
print(f"{segment.id}: {segment.text[:50]}...")
# Add custom segments
vergil.add_segment(
text="Custom text",
seg_id="custom.1",
meta={"source": "manual"}
)
head
¶
head(n: int = 5) -> List[TextSegment]
Return the first n segments in document order.
This is a lightweight convenience method for inspection and mirrors
the semantics of common tabular head() helpers.
add_segment
¶
add_segment(
text: str,
seg_id: ID,
*,
row_id: int | None = None,
meta: Dict[str, Any] | None = None,
) -> None
Add a new text segment to the document.
statistics
¶
Return descriptive statistics (segment count, char/word totals, averages, min/max).
to_dataframe
¶
Return document segments as a pandas DataFrame in document order.
The resulting DataFrame contains one row per segment with the
columns seg_id, text, row_id, and meta.
to_dict
¶
Return the document as a plain Python dictionary.
The result contains document-level metadata and an ordered list of
segment records that can be passed to :meth:from_dict.
from_dataframe
classmethod
¶
from_dataframe(
df: DataFrame,
*,
path: str | Path = "<memory>",
author: str | None = None,
meta: Dict[str, Any] | None = None,
) -> Document
Construct a document from a pandas DataFrame.
The DataFrame must contain seg_id and text columns. Optional
row_id and meta columns are used when present.
from_dict
classmethod
¶
from_dict(
data: Dict[str, Any],
*,
path: str | Path | None = None,
author: str | None = None,
meta: Dict[str, Any] | None = None,
) -> Document
Construct a document from a dictionary produced by :meth:to_dict.
clean
¶
clean(
*,
normalize_unicode: bool = True,
normalize_quotes: bool = True,
normalize_dashes: bool = True,
collapse_whitespace: bool = True,
strip_whitespace: bool = True,
drop_empty: bool = True,
) -> Document
Apply conservative text cleanup to all segments in-place.
The default behavior is designed to be safe for contextual retrieval: it normalizes Unicode composition, quote and dash variants, and whitespace without changing case, punctuation, or Latin orthography.
| PARAMETER | DESCRIPTION |
|---|---|
normalize_unicode
|
Normalize text to Unicode NFC.
TYPE:
|
normalize_quotes
|
Replace curly and angled quotation variants with plain ASCII quotes/apostrophes.
TYPE:
|
normalize_dashes
|
Replace Unicode dash variants with
TYPE:
|
collapse_whitespace
|
Collapse repeated whitespace to single spaces.
TYPE:
|
strip_whitespace
|
Remove leading and trailing whitespace.
TYPE:
|
drop_empty
|
Remove segments that become empty after cleaning.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
The same |
sentencize
¶
sentencize(
*,
splitter: Optional[Callable[[str], List[str]]] = None,
id_separator: str = ".",
) -> Document
Re-segment this document so that each segment contains exactly one sentence.
All segment texts are first joined (in row-id order) and then sentence-split as a single block. This correctly handles:
- Segments containing multiple sentences → split into separate segments.
- A single sentence spanning multiple rows → merged into one segment.
New segment IDs are derived from the original segment whose text
starts the sentence, with a numeric suffix appended (e.g.
"seg1.1", "seg1.2").
| PARAMETER | DESCRIPTION |
|---|---|
splitter
|
A callable that takes a
TYPE:
|
id_separator
|
Separator inserted between the original
segment ID and the sentence index when a segment is
split (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Document
|
The modified |
save_plain
¶
Write all segment texts to a plain-text file.
save_csv
¶
Write all segments to a CSV file with seg_id and text columns.