Cache and chunking internals¶
This page documents the small helpers under lmsyz_genai_ie_rfs.dataframe. Two concerns live together here: caching (SqliteCache, the SQLite-backed results store that makes extract_df resumable) and chunking (DataFrameIterator, the iterator that splits a DataFrame into per-call input blocks).
For a conceptual walkthrough of the results database, see Results database. For hands-on inspection of a live cache file, see Inspect the results database.
Caching¶
SqliteCache is the on-disk results store used by extract_df. Every completed row is written to it as it finishes. The cache is also prompt-hash aware: rows produced under a different prompt are invisible by default, so changing the prompt automatically re-executes affected rows.
SQLite-backed get/put cache for resuming interrupted runs.
Each row is stored with the hash of the prompt that produced it. On
read, callers can pass prompt_hash= to restrict the result set to
rows produced by the same prompt: if the prompt changed, cached rows
are invisible, and extract_df will re-run them.
Backward compatible: caches created before the prompt_hash column
existed are migrated with ALTER TABLE on first open. Rows whose
hash is NULL are treated as not-matching any specific hash.
__init__(db_path: Path) -> None
¶
Create the DB file and table if absent; migrate legacy schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db_path
|
Path
|
Path to the SQLite file. |
required |
get(row_id: str, prompt_hash: str | None = None) -> dict[str, Any] | None
¶
Return the cached row dict, or None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_id
|
str
|
Row identifier. |
required |
prompt_hash
|
str | None
|
If given, only return the row if it was stored under this exact prompt hash. If None, return whatever is there regardless of hash. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
The cached dict, or None. |
put(row_id: str, result: dict[str, Any], prompt_hash: str | None = None) -> None
¶
Upsert a result for row_id under prompt_hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_id
|
str
|
Row identifier. |
required |
result
|
dict[str, Any]
|
The row dict to cache. |
required |
prompt_hash
|
str | None
|
Hash of the prompt that produced |
None
|
all_ids(prompt_hash: str | None = None) -> set[str]
¶
Return cached row IDs, optionally filtered by prompt hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_hash
|
str | None
|
If given, return only IDs whose cached row was stored under this hash. If None, return all IDs regardless. |
None
|
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of row IDs (possibly empty). |
compute_prompt_hash¶
extract_df stamps every cached row with compute_prompt_hash(prompt). Lookups are gated on this hash so a prompt change produces a cache miss and re-runs the row.
Return a short stable hash of prompt for cache invalidation.
Uses SHA-256 truncated to 16 hex chars. Stable across Python versions
and machines, unlike the built-in hash().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
Prompt text. |
required |
Returns:
| Type | Description |
|---|---|
str
|
16-character lowercase hex digest. |
Chunking¶
DataFrameIterator is an internal helper that splits a DataFrame into fixed-size chunks of {input_id, input_text} dicts, ready to be serialized as the user message for each LLM call. Most users never touch it directly; it is documented here because it shows up in the public attribute chunk_size on extract_df.
Chunk a DataFrame into formatted dicts for LLM input.
Attributes:
| Name | Type | Description |
|---|---|---|
dataframe |
Source DataFrame. |
|
chunk_size |
Rows per chunk. |
|
id_col |
Source column name for identifiers. |
|
text_col |
Source column name for text. |
|
formatted_id_col |
Key used in each output dict for the identifier. |
|
formatted_text_col |
Key used in each output dict for the text. |
__init__(dataframe: pd.DataFrame, id_col: str, text_col: str, chunk_size: int = 5, formatted_id_col: str = 'input_id', formatted_text_col: str = 'input_text') -> None
¶
Store the chunking configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe
|
DataFrame
|
The DataFrame to iterate. |
required |
id_col
|
str
|
Column name with row identifiers. |
required |
text_col
|
str
|
Column name with text content. |
required |
chunk_size
|
int
|
Rows per chunk. Default 5. |
5
|
formatted_id_col
|
str
|
Output-dict key for the identifier. Default "input_id". |
'input_id'
|
formatted_text_col
|
str
|
Output-dict key for the text. Default "input_text". |
'input_text'
|
__iter__() -> DataFrameIterator
¶
Reset and return self.
__next__() -> list[dict[str, str]]
¶
Return the next chunk of formatted dicts.
Raises:
| Type | Description |
|---|---|
StopIteration
|
When all rows have been yielded. |
__len__() -> int
¶
Number of chunks (ceiling division).
See also¶
- Results database: the conceptual picture, schema, hash gating.
- Inspect the results database: SQL recipes and Python patterns.
- Resume after a crash: the most common reason to care about the cache.