Skip to content

extract_df

extract_df is the primary entry point for concurrent extraction. It chunks a DataFrame, sends each chunk to OpenAI or Anthropic in parallel via a ThreadPoolExecutor, and returns a flat DataFrame of results. Every completed row is written to a SQLite file as it finishes. An interrupted run resumes from where it stopped on the next call with the same cache_path.

Run a prompt over each row of df concurrently and return a DataFrame.

The schema argument is optional. When omitted, the prompt alone defines the output shape and the model returns free-form JSON. When supplied, it enforces structure on both providers: OpenAI via response_format={"type": "json_schema", ...}, Anthropic via forced tool_use with the same schema.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame. Must contain id_col and text_col.

required
prompt str

System prompt sent with every chunk.

required
schema SchemaInput

Optional JSON schema. Accepts None, a dict (full OpenAI wrapper, a response schema with all_results, or a row schema that will be auto-wrapped), or a str/Path pointing to a JSON file containing any of the above.

None
backend str

"openai" or "anthropic".

'openai'
model str

Model identifier (e.g., "gpt-4.1-mini").

required
id_col str

Column in df holding row identifiers.

'id'
text_col str

Column in df holding text content.

'text'
chunk_size int

Rows per LLM request.

5
max_workers int

ThreadPoolExecutor size.

20
cache_path str | Path

Required path to a SQLite file. Every completed row is written to this file as it finishes, so an interrupted run resumes without re-spending tokens. Each row is stamped with a hash of the prompt that produced it; a later run with a different prompt re-executes those rows automatically (override with ignore_prompt_hash=True). The file is never auto-deleted.

required
fresh bool

If True, re-process every row regardless of cache contents.

False
ignore_prompt_hash bool

If True, reuse cached rows even when the prompt has changed since they were written. Default False: changing the prompt invalidates the cache, which is usually what you want during iteration. Set True when resuming a run where you deliberately edited the prompt in a non-semantic way (e.g., typo fix).

False
api_key str | None

Override the API key from settings / env.

None
base_url str | None

Override the OpenAI base URL (for OpenRouter, Gemini compat).

None
client Any

A pre-built SDK client. When given, backend, api_key, and base_url are only used to pick the call helper.

None

Returns:

Type Description
DataFrame

DataFrame of result dicts, one row per input row. Rows whose chunks

DataFrame

fail are logged and omitted; remaining chunks' results are returned.


Temperature rules

The _requires_temp_one function returns True for model families that only accept temperature=1.0. All other models use temperature=0.0.

Model family Temperature enforced
o1, o1-mini, o1-preview, ... 1.0
o3, o3-mini, ... 1.0
gpt-5, gpt-5-mini, ... 1.0
Everything else 0.0 (deterministic)

The check uses the model name string: lower.startswith(("o1", "o3")) or "gpt-5" in lower. You cannot override this in the concurrent path; it is automatic. In the batch path, create_batch_jsonl accepts a temperature argument but overrides it to 1.0 for the affected families.


See also