Document Module¶

Classes for representing and loading text collections.

TextSegment¶

An individual unit of text with an identifier.

locisimiles.document.TextSegment ¶

TextSegment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
)

Atomic unit of text inside a document.

A TextSegment represents a single passage, sentence, or verse from a larger document. Each segment has a unique identifier and optional metadata.

ATTRIBUTE	DESCRIPTION
`text`	The raw text content of the segment. TYPE: `str`
`id`	Unique identifier for the segment (e.g., "verg. aen. 1.1"). TYPE: `ID`
`row_id`	Position of the segment in the original document (0-indexed). TYPE: `int \| None`
`meta`	Optional dictionary of additional metadata. TYPE: `Dict[str, Any]`

Example

segment = TextSegment(
    text="Arma virumque cano, Troiae qui primus ab oris",
    seg_id="verg. aen. 1.1",
    row_id=0,
    meta={"book": 1, "line": 1}
)
print(segment.text)  # "Arma virumque cano..."
print(segment.id)    # "verg. aen. 1.1"

Document¶

A collection of text segments with loading utilities.

locisimiles.document.Document ¶

Document(
    path: str | Path,
    *,
    author: str | None = None,
    meta: Dict[str, Any] | None = None,
    segment_delimiter: str = "\n",
)

Collection of text segments representing a document.

A Document is a container for TextSegments loaded from a file. It supports CSV/TSV files with 'seg_id' and 'text' columns, or plain text files where segments are separated by a delimiter.

ATTRIBUTE	DESCRIPTION
`path`	Path to the source file. TYPE: `Path`
`author`	Optional author name for the document. TYPE: `str \| None`
`meta`	Optional dictionary of document-level metadata. TYPE: `Dict[str, Any]`

Example

from locisimiles.document import Document

# Load from CSV (must have 'seg_id' and 'text' columns)
vergil = Document("vergil_samples.csv", author="Vergil")

# Access segments
print(len(vergil))           # Number of segments
print(vergil.ids())          # List of segment IDs
print(vergil.get_text("verg. aen. 1.1"))  # Get text by ID

# Iterate over segments
for segment in vergil:
    print(f"{segment.id}: {segment.text[:50]}...")

# Add custom segments
vergil.add_segment(
    text="Custom text",
    seg_id="custom.1",
    meta={"source": "manual"}
)

ids ¶

ids() -> List[ID]

Return segment IDs in original order.

get_text ¶

get_text(seg_id: ID) -> str

Return raw text of a segment.

head ¶

head(n: int = 5) -> List[TextSegment]

Return the first n segments in document order.

This is a lightweight convenience method for inspection and mirrors the semantics of common tabular head() helpers.

add_segment ¶

add_segment(
    text: str,
    seg_id: ID,
    *,
    row_id: int | None = None,
    meta: Dict[str, Any] | None = None,
) -> None

Add a new text segment to the document.

remove_segment ¶

remove_segment(seg_id: ID) -> None

Delete a segment if present.

statistics ¶

statistics() -> Dict[str, Any]

Return descriptive statistics (segment count, char/word totals, averages, min/max).

to_dataframe ¶

to_dataframe() -> DataFrame

Return document segments as a pandas DataFrame in document order.

The resulting DataFrame contains one row per segment with the columns seg_id, text, row_id, and meta.

to_dict ¶

to_dict() -> Dict[str, Any]

Return the document as a plain Python dictionary.

The result contains document-level metadata and an ordered list of segment records that can be passed to :meth:from_dict.

from_dataframe `classmethod` ¶

from_dataframe(
    df: DataFrame,
    *,
    path: str | Path = "<memory>",
    author: str | None = None,
    meta: Dict[str, Any] | None = None,
) -> Document

Construct a document from a pandas DataFrame.

The DataFrame must contain seg_id and text columns. Optional row_id and meta columns are used when present.

from_dict `classmethod` ¶

from_dict(
    data: Dict[str, Any],
    *,
    path: str | Path | None = None,
    author: str | None = None,
    meta: Dict[str, Any] | None = None,
) -> Document

Construct a document from a dictionary produced by :meth:to_dict.

clean ¶

clean(
    *,
    normalize_unicode: bool = True,
    normalize_quotes: bool = True,
    normalize_dashes: bool = True,
    collapse_whitespace: bool = True,
    strip_whitespace: bool = True,
    drop_empty: bool = True,
) -> Document

Apply conservative text cleanup to all segments in-place.

The default behavior is designed to be safe for contextual retrieval: it normalizes Unicode composition, quote and dash variants, and whitespace without changing case, punctuation, or Latin orthography.

PARAMETER	DESCRIPTION
`normalize_unicode`	Normalize text to Unicode NFC. TYPE: `bool` DEFAULT: `True`
`normalize_quotes`	Replace curly and angled quotation variants with plain ASCII quotes/apostrophes. TYPE: `bool` DEFAULT: `True`
`normalize_dashes`	Replace Unicode dash variants with `-`. TYPE: `bool` DEFAULT: `True`
`collapse_whitespace`	Collapse repeated whitespace to single spaces. TYPE: `bool` DEFAULT: `True`
`strip_whitespace`	Remove leading and trailing whitespace. TYPE: `bool` DEFAULT: `True`
`drop_empty`	Remove segments that become empty after cleaning. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Document`	The same `Document` instance after in-place cleaning.

sentencize ¶

sentencize(
    *,
    splitter: Optional[Callable[[str], List[str]]] = None,
    id_separator: str = ".",
) -> Document

Re-segment this document so that each segment contains exactly one sentence.

All segment texts are first joined (in row-id order) and then sentence-split as a single block. This correctly handles:

Segments containing multiple sentences → split into separate segments.
A single sentence spanning multiple rows → merged into one segment.

New segment IDs are derived from the original segment whose text starts the sentence, with a numeric suffix appended (e.g. "seg1.1", "seg1.2").

PARAMETER	DESCRIPTION
`splitter`	A callable that takes a `str` and returns a list of sentence strings. When `None` a simple punctuation-based splitter is used. To use spaCy:: `import spacy nlp = spacy.load("la_core_web_lg") doc.sentencize(splitter=lambda t: [s.text for s in nlp(t).sents])` TYPE: `Optional[Callable[[str], List[str]]]` DEFAULT: `None`
`id_separator`	Separator inserted between the original segment ID and the sentence index when a segment is split (e.g. `"seg1"` → `"seg1.1"`, `"seg1.2"`). TYPE: `str` DEFAULT: `'.'`

RETURNS	DESCRIPTION
`Document`	The modified `Document` with one sentence per segment.

Example

doc = Document("mixed.csv")
doc.sentencize()

save_plain ¶

save_plain(
    path: str | Path, *, delimiter: str = "\n"
) -> Path

Write all segment texts to a plain-text file.

save_csv ¶

save_csv(path: str | Path) -> Path

Write all segments to a CSV file with seg_id and text columns.

Document Module¶

TextSegment¶

locisimiles.document.TextSegment ¶

Document¶

locisimiles.document.Document ¶

ids ¶

get_text ¶

head ¶

add_segment ¶

remove_segment ¶

statistics ¶

to_dataframe ¶

to_dict ¶

from_dataframe classmethod ¶

from_dict classmethod ¶

clean ¶

sentencize ¶

save_plain ¶

save_csv ¶

from_dataframe `classmethod` ¶

from_dict `classmethod` ¶