Cache and chunking internals¶

This page documents the small helpers under lmsyz_genai_ie_rfs.dataframe. Two concerns live together here: caching (SqliteCache, the SQLite-backed results store that makes extract_df resumable) and chunking (DataFrameIterator, the iterator that splits a DataFrame into per-call input blocks).

For a conceptual walkthrough of the results database, see Results database. For hands-on inspection of a live cache file, see Inspect the results database.

Caching¶

SqliteCache is the on-disk results store used by extract_df. Every completed row is written to it as it finishes. The cache is also prompt-hash aware: rows produced under a different prompt are invisible by default, so changing the prompt automatically re-executes affected rows.

SQLite-backed get/put cache for resuming interrupted runs.

Each row is stored with the hash of the prompt that produced it. On read, callers can pass prompt_hash= to restrict the result set to rows produced by the same prompt: if the prompt changed, cached rows are invisible, and extract_df will re-run them.

Backward compatible: caches created before the prompt_hash column existed are migrated with ALTER TABLE on first open. Rows whose hash is NULL are treated as not-matching any specific hash.

`init(db_path: Path) -> None` ¶

Create the DB file and table if absent; migrate legacy schema.

Parameters:

Name	Type	Description	Default
`db_path`	`Path`	Path to the SQLite file.	required

`get(row_id: str, prompt_hash: str | None = None) -> dict[str, Any] | None` ¶

Return the cached row dict, or None.

Parameters:

Name	Type	Description	Default
`row_id`	`str`	Row identifier.	required
`prompt_hash`	`str \| None`	If given, only return the row if it was stored under this exact prompt hash. If None, return whatever is there regardless of hash.	`None`

Returns:

Type	Description
`dict[str, Any] \| None`	The cached dict, or None.

`put(row_id: str, result: dict[str, Any], prompt_hash: str | None = None) -> None` ¶

Upsert a result for row_id under prompt_hash.

Parameters:

Name	Type	Description	Default
`row_id`	`str`	Row identifier.	required
`result`	`dict[str, Any]`	The row dict to cache.	required
`prompt_hash`	`str \| None`	Hash of the prompt that produced `result`.	`None`

`all_ids(prompt_hash: str | None = None) -> set[str]` ¶

Return cached row IDs, optionally filtered by prompt hash.

Parameters:

Name	Type	Description	Default
`prompt_hash`	`str \| None`	If given, return only IDs whose cached row was stored under this hash. If None, return all IDs regardless.	`None`

Returns:

Type	Description
`set[str]`	Set of row IDs (possibly empty).

compute_prompt_hash¶

extract_df stamps every cached row with compute_prompt_hash(prompt). Lookups are gated on this hash so a prompt change produces a cache miss and re-runs the row.

Return a short stable hash of prompt for cache invalidation.

Uses SHA-256 truncated to 16 hex chars. Stable across Python versions and machines, unlike the built-in hash().

Parameters:

Name	Type	Description	Default
`prompt`	`str`	Prompt text.	required

Returns:

Type	Description
`str`	16-character lowercase hex digest.

Chunking¶

DataFrameIterator is an internal helper that splits a DataFrame into fixed-size chunks of {input_id, input_text} dicts, ready to be serialized as the user message for each LLM call. Most users never touch it directly; it is documented here because it shows up in the public attribute chunk_size on extract_df.

Chunk a DataFrame into formatted dicts for LLM input.

Attributes:

Name	Type	Description
`dataframe`		Source DataFrame.
`chunk_size`		Rows per chunk.
`id_col`		Source column name for identifiers.
`text_col`		Source column name for text.
`formatted_id_col`		Key used in each output dict for the identifier.
`formatted_text_col`		Key used in each output dict for the text.

`init(dataframe: pd.DataFrame, id_col: str, text_col: str, chunk_size: int = 5, formatted_id_col: str = 'input_id', formatted_text_col: str = 'input_text') -> None` ¶

Store the chunking configuration.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to iterate.	required
`id_col`	`str`	Column name with row identifiers.	required
`text_col`	`str`	Column name with text content.	required
`chunk_size`	`int`	Rows per chunk. Default 5.	`5`
`formatted_id_col`	`str`	Output-dict key for the identifier. Default "input_id".	`'input_id'`
`formatted_text_col`	`str`	Output-dict key for the text. Default "input_text".	`'input_text'`

`iter() -> DataFrameIterator` ¶

Reset and return self.

`next() -> list[dict[str, str]]` ¶

Return the next chunk of formatted dicts.

Raises:

Type	Description
`StopIteration`	When all rows have been yielded.

`len() -> int` ¶

Number of chunks (ceiling division).

Cache and chunking internals¶

Caching¶

__init__(db_path: Path) -> None ¶

get(row_id: str, prompt_hash: str | None = None) -> dict[str, Any] | None ¶

put(row_id: str, result: dict[str, Any], prompt_hash: str | None = None) -> None ¶

all_ids(prompt_hash: str | None = None) -> set[str] ¶

compute_prompt_hash¶

Chunking¶

__init__(dataframe: pd.DataFrame, id_col: str, text_col: str, chunk_size: int = 5, formatted_id_col: str = 'input_id', formatted_text_col: str = 'input_text') -> None ¶

__iter__() -> DataFrameIterator ¶

__next__() -> list[dict[str, str]] ¶

__len__() -> int ¶

See also¶

`init(db_path: Path) -> None` ¶

`get(row_id: str, prompt_hash: str | None = None) -> dict[str, Any] | None` ¶

`put(row_id: str, result: dict[str, Any], prompt_hash: str | None = None) -> None` ¶

`all_ids(prompt_hash: str | None = None) -> set[str]` ¶

`init(dataframe: pd.DataFrame, id_col: str, text_col: str, chunk_size: int = 5, formatted_id_col: str = 'input_id', formatted_text_col: str = 'input_text') -> None` ¶

`iter() -> DataFrameIterator` ¶

`next() -> list[dict[str, str]]` ¶

`len() -> int` ¶