extract_df¶

extract_df is the primary entry point for concurrent extraction. It chunks a DataFrame, sends each chunk to OpenAI or Anthropic in parallel via a ThreadPoolExecutor, and returns a flat DataFrame of results. Every completed row is written to a SQLite file as it finishes. An interrupted run resumes from where it stopped on the next call with the same cache_path.

Run a prompt over each row of df concurrently and return a DataFrame.

The schema argument is optional. When omitted, the prompt alone defines the output shape and the model returns free-form JSON. When supplied, it enforces structure on both providers: OpenAI via response_format={"type": "json_schema", ...}, Anthropic via forced tool_use with the same schema.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame. Must contain `id_col` and `text_col`.	required
`prompt`	`str`	System prompt sent with every chunk.	required
`schema`	`SchemaInput`	Optional JSON schema. Accepts `None`, a dict (full OpenAI wrapper, a response schema with `all_results`, or a row schema that will be auto-wrapped), or a `str`/`Path` pointing to a JSON file containing any of the above.	`None`
`backend`	`str`	`"openai"` or `"anthropic"`.	`'openai'`
`model`	`str`	Model identifier (e.g., `"gpt-4.1-mini"`).	required
`id_col`	`str`	Column in `df` holding row identifiers.	`'id'`
`text_col`	`str`	Column in `df` holding text content.	`'text'`
`chunk_size`	`int`	Rows per LLM request.	`5`
`max_workers`	`int`	ThreadPoolExecutor size.	`20`
`cache_path`	`str \| Path`	Required path to a SQLite file. Every completed row is written to this file as it finishes, so an interrupted run resumes without re-spending tokens. Each row is stamped with a hash of the prompt that produced it; a later run with a different prompt re-executes those rows automatically (override with `ignore_prompt_hash=True`). The file is never auto-deleted.	required
`fresh`	`bool`	If True, re-process every row regardless of cache contents.	`False`
`ignore_prompt_hash`	`bool`	If True, reuse cached rows even when the prompt has changed since they were written. Default False: changing the prompt invalidates the cache, which is usually what you want during iteration. Set True when resuming a run where you deliberately edited the prompt in a non-semantic way (e.g., typo fix).	`False`
`api_key`	`str \| None`	Override the API key from settings / env.	`None`
`base_url`	`str \| None`	Override the OpenAI base URL (for OpenRouter, Gemini compat).	`None`
`client`	`Any`	A pre-built SDK client. When given, `backend`, `api_key`, and `base_url` are only used to pick the call helper.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame of result dicts, one row per input row. Rows whose chunks
`DataFrame`	fail are logged and omitted; remaining chunks' results are returned.

Temperature rules¶

The _requires_temp_one function returns True for model families that only accept temperature=1.0. All other models use temperature=0.0.

Model family	Temperature enforced
`o1`, `o1-mini`, `o1-preview`, ...	1.0
`o3`, `o3-mini`, ...	1.0
`gpt-5`, `gpt-5-mini`, ...	1.0
Everything else	0.0 (deterministic)

The check uses the model name string: lower.startswith(("o1", "o3")) or "gpt-5" in lower. You cannot override this in the concurrent path; it is automatic. In the batch path, create_batch_jsonl accepts a temperature argument but overrides it to 1.0 for the affected families.

extract_df¶

Temperature rules¶

See also¶