extract_df¶
extract_df is the primary entry point for concurrent extraction. It chunks a DataFrame,
sends each chunk to OpenAI or Anthropic in parallel via a ThreadPoolExecutor, and
returns a flat DataFrame of results. Every completed row is written to a SQLite file as
it finishes. An interrupted run resumes from where it stopped on the next call with the
same cache_path.
Run a prompt over each row of df concurrently and return a DataFrame.
The schema argument is optional. When omitted, the prompt alone
defines the output shape and the model returns free-form JSON. When
supplied, it enforces structure on both providers: OpenAI via
response_format={"type": "json_schema", ...}, Anthropic via forced
tool_use with the same schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. Must contain |
required |
prompt
|
str
|
System prompt sent with every chunk. |
required |
schema
|
SchemaInput
|
Optional JSON schema. Accepts |
None
|
backend
|
str
|
|
'openai'
|
model
|
str
|
Model identifier (e.g., |
required |
id_col
|
str
|
Column in |
'id'
|
text_col
|
str
|
Column in |
'text'
|
chunk_size
|
int
|
Rows per LLM request. |
5
|
max_workers
|
int
|
ThreadPoolExecutor size. |
20
|
cache_path
|
str | Path
|
Required path to a SQLite file. Every completed row is
written to this file as it finishes, so an interrupted run
resumes without re-spending tokens. Each row is stamped with a
hash of the prompt that produced it; a later run with a
different prompt re-executes those rows automatically (override
with |
required |
fresh
|
bool
|
If True, re-process every row regardless of cache contents. |
False
|
ignore_prompt_hash
|
bool
|
If True, reuse cached rows even when the prompt has changed since they were written. Default False: changing the prompt invalidates the cache, which is usually what you want during iteration. Set True when resuming a run where you deliberately edited the prompt in a non-semantic way (e.g., typo fix). |
False
|
api_key
|
str | None
|
Override the API key from settings / env. |
None
|
base_url
|
str | None
|
Override the OpenAI base URL (for OpenRouter, Gemini compat). |
None
|
client
|
Any
|
A pre-built SDK client. When given, |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame of result dicts, one row per input row. Rows whose chunks |
DataFrame
|
fail are logged and omitted; remaining chunks' results are returned. |
Temperature rules¶
The _requires_temp_one function returns True for model families that only accept
temperature=1.0. All other models use temperature=0.0.
| Model family | Temperature enforced |
|---|---|
o1, o1-mini, o1-preview, ... |
1.0 |
o3, o3-mini, ... |
1.0 |
gpt-5, gpt-5-mini, ... |
1.0 |
| Everything else | 0.0 (deterministic) |
The check uses the model name string: lower.startswith(("o1", "o3")) or "gpt-5" in lower.
You cannot override this in the concurrent path; it is automatic. In the batch path,
create_batch_jsonl accepts a temperature argument but overrides it to 1.0 for the
affected families.
See also¶
- Resume after a crash: how
cache_pathenables zero-loss restarts. - Change the prompt safely: how prompt-hash
invalidation works and when to use
ignore_prompt_hash=True. - Switch providers: switching between OpenAI, Anthropic, and Gemini.