Skip to content

Cache and chunking internals

This page documents the small helpers under lmsyz_genai_ie_rfs.dataframe. Two concerns live together here: caching (SqliteCache, the SQLite-backed results store that makes extract_df resumable) and chunking (DataFrameIterator, the iterator that splits a DataFrame into per-call input blocks).

For a conceptual walkthrough of the results database, see Results database. For hands-on inspection of a live cache file, see Inspect the results database.


Caching

SqliteCache is the on-disk results store used by extract_df. Every completed row is written to it as it finishes. The cache is also prompt-hash aware: rows produced under a different prompt are invisible by default, so changing the prompt automatically re-executes affected rows.

SQLite-backed get/put cache for resuming interrupted runs.

Each row is stored with the hash of the prompt that produced it. On read, callers can pass prompt_hash= to restrict the result set to rows produced by the same prompt: if the prompt changed, cached rows are invisible, and extract_df will re-run them.

Backward compatible: caches created before the prompt_hash column existed are migrated with ALTER TABLE on first open. Rows whose hash is NULL are treated as not-matching any specific hash.

__init__(db_path: Path) -> None

Create the DB file and table if absent; migrate legacy schema.

Parameters:

Name Type Description Default
db_path Path

Path to the SQLite file.

required

get(row_id: str, prompt_hash: str | None = None) -> dict[str, Any] | None

Return the cached row dict, or None.

Parameters:

Name Type Description Default
row_id str

Row identifier.

required
prompt_hash str | None

If given, only return the row if it was stored under this exact prompt hash. If None, return whatever is there regardless of hash.

None

Returns:

Type Description
dict[str, Any] | None

The cached dict, or None.

put(row_id: str, result: dict[str, Any], prompt_hash: str | None = None) -> None

Upsert a result for row_id under prompt_hash.

Parameters:

Name Type Description Default
row_id str

Row identifier.

required
result dict[str, Any]

The row dict to cache.

required
prompt_hash str | None

Hash of the prompt that produced result.

None

all_ids(prompt_hash: str | None = None) -> set[str]

Return cached row IDs, optionally filtered by prompt hash.

Parameters:

Name Type Description Default
prompt_hash str | None

If given, return only IDs whose cached row was stored under this hash. If None, return all IDs regardless.

None

Returns:

Type Description
set[str]

Set of row IDs (possibly empty).


compute_prompt_hash

extract_df stamps every cached row with compute_prompt_hash(prompt). Lookups are gated on this hash so a prompt change produces a cache miss and re-runs the row.

Return a short stable hash of prompt for cache invalidation.

Uses SHA-256 truncated to 16 hex chars. Stable across Python versions and machines, unlike the built-in hash().

Parameters:

Name Type Description Default
prompt str

Prompt text.

required

Returns:

Type Description
str

16-character lowercase hex digest.


Chunking

DataFrameIterator is an internal helper that splits a DataFrame into fixed-size chunks of {input_id, input_text} dicts, ready to be serialized as the user message for each LLM call. Most users never touch it directly; it is documented here because it shows up in the public attribute chunk_size on extract_df.

Chunk a DataFrame into formatted dicts for LLM input.

Attributes:

Name Type Description
dataframe

Source DataFrame.

chunk_size

Rows per chunk.

id_col

Source column name for identifiers.

text_col

Source column name for text.

formatted_id_col

Key used in each output dict for the identifier.

formatted_text_col

Key used in each output dict for the text.

__init__(dataframe: pd.DataFrame, id_col: str, text_col: str, chunk_size: int = 5, formatted_id_col: str = 'input_id', formatted_text_col: str = 'input_text') -> None

Store the chunking configuration.

Parameters:

Name Type Description Default
dataframe DataFrame

The DataFrame to iterate.

required
id_col str

Column name with row identifiers.

required
text_col str

Column name with text content.

required
chunk_size int

Rows per chunk. Default 5.

5
formatted_id_col str

Output-dict key for the identifier. Default "input_id".

'input_id'
formatted_text_col str

Output-dict key for the text. Default "input_text".

'input_text'

__iter__() -> DataFrameIterator

Reset and return self.

__next__() -> list[dict[str, str]]

Return the next chunk of formatted dicts.

Raises:

Type Description
StopIteration

When all rows have been yielded.

__len__() -> int

Number of chunks (ceiling division).


See also