Scoring¶

The scoring kernel turns a doc-level corpus and an expanded dictionary into document-level scores. Three base methods are supported: TF (raw counts), TFIDF (tf * log(N/df)), and WFIDF ((1 + log tf) * log(N/df)). Each can be combined with the similarity_weights kernel to produce TFIDF+SIMWEIGHT and WFIDF+SIMWEIGHT. The streaming design avoids materializing the full corpus in memory: document frequencies and per-document text are built in a single pass over the sentence file.

Pipeline.score chains these primitives together; the same functions are exported for users who want to score a corpus directly.

ScoringMethod¶

The string literal accepted by score_document and score_documents. Valid values are "TF", "TFIDF", "WFIDF", "TFIDF+SIMWEIGHT", and "WFIDF+SIMWEIGHT".

iter_doc_level_corpus¶

Folds a sentence-level corpus back to document level by grouping consecutive sentence lines whose IDs share the same docID_ prefix.

Yield (doc_id, document_text) by folding sentences on their doc prefix.

The CoreNLP pass produces sentence IDs shaped docID_sentenceN. This function groups consecutive sentence lines with matching doc prefixes and yields one document at a time.

Parameters:

Name	Type	Description	Default
`sent_corpus_path`	`Path \| str`	File of cleaned sentences, one per line.	required
`sent_id_path`	`Path \| str`	Matching sentence IDs.	required

Yields:

Type	Description
`tuple[str, str]`	`(doc_id, concatenated_document_text)` pairs.

document_frequencies¶

Computes the document-frequency dictionary and total document count needed by every non-TF scoring method.

Compute document frequency for every token.

Parameters:

Name	Type	Description	Default
`documents`	`Iterable[str]`	Iterable of whitespace-tokenized documents.	required
`show_progress`	`bool`	Print a tqdm bar.	`True`

Returns:

Type	Description
`tuple[dict[str, int], int]`	`(df_dict, n_documents)`.

score_document¶

Scores one document across all dimensions.

Score one document across all dimensions.

Parameters:

Name	Type	Description	Default
`document`	`str`	Whitespace-tokenized text.	required
`expanded_words`	`dict[str, list[str] \| set[str]]`	Expanded dictionary per dimension.	required
`method`	`ScoringMethod`	One of TF, TFIDF, WFIDF, TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT.	`'TF'`
`df_dict`	`dict[str, int] \| None`	Document frequencies. Required for non-TF methods.	`None`
`n_docs`	`int \| None`	Total document count. Required for non-TF methods.	`None`
`word_weights`	`dict[str, float] \| None`	Per-word weights. Required for SIMWEIGHT methods.	`None`

Returns:

Type	Description
`tuple[list[float], int]`	`(scores_sorted_by_dim, document_length)`.

Raises:

Type	Description
`ValueError`	On inconsistent arguments.

score_documents¶

Scores an iterable of (doc_id, text) pairs and returns a DataFrame with one row per document.

Score an iterable of documents and return a DataFrame.

Parameters:

Name	Type	Description	Default
`documents`	`Iterable[tuple[str, str]]`	Iterable of `(doc_id, text)` pairs.	required
`expanded_words`	`dict[str, list[str] \| set[str]]`	Expanded dictionary.	required
`method`	`ScoringMethod`	Scoring method.	`'TFIDF'`
`df_dict`	`dict[str, int] \| None`	Document frequencies (non-TF methods).	`None`
`n_docs`	`int \| None`	Total document count (non-TF methods).	`None`
`word_weights`	`dict[str, float] \| None`	Per-word weights (SIMWEIGHT methods).	`None`
`normalize`	`bool`	L2-normalize score vector per document.	`False`
`show_progress`	`bool`	Print a tqdm bar.	`True`

Returns:

Type	Description
`DataFrame`	DataFrame with `Doc_ID`, the sorted dimensions, and
`DataFrame`	`document_length`.

aggregate_to_firm_year¶

Joins document-level scores to a firm-year mapping, normalizes each dimension by document length (per 100 tokens), and averages within each firm-year cell.

Aggregate document-level scores to firm-year means.

Each dimension is first divided by document length and multiplied by 100 to put units in "per 100 tokens."

Parameters:

Name	Type	Description	Default
`scores`	`DataFrame`	DataFrame from `score_documents`.	required
`id_to_firm`	`DataFrame`	DataFrame with the document-id to firm-year mapping.	required
`dims`	`list[str]`	Dimension column names to normalize.	required
`doc_id_col`	`str`	Document-ID column in `scores`.	`'Doc_ID'`
`id_col`	`str`	Document-ID column in `id_to_firm`.	`'document_id'`
`firm_col`	`str`	Firm-ID column in `id_to_firm`.	`'firm_id'`
`time_col`	`str`	Time column in `id_to_firm`.	`'time'`

Returns:

Type	Description
`DataFrame`	Firm-year DataFrame sorted by `firm_id, time`.

zca_whiten¶

Applies ZCA whitening to a matrix of document-level scores so that each dimension has unit variance and the dimensions are decorrelated. Useful when the seed dimensions overlap in embedding space.

Apply ZCA whitening to the dimension columns of a scores DataFrame.

ZCA (zero-phase component analysis) whitening is a linear transform that decorrelates the columns (makes the covariance the identity) while staying as close as possible to the original axes. Unlike PCA whitening, it does not rotate the data into a new basis, so after whitening the column named integrity still measures something close to integrity (not "principal component 1"). This matters when downstream analysis interprets each dimension by name.

This is a post-scoring transform. Input columns retain their names; output values are the whitened coordinates. Non-dimension columns (Doc_ID, document_length) pass through unchanged.

The whitening transform is fit on scores[dims] itself. If you plan to score new documents later and want them on the same whitened scale, compute and persist the transform separately (see the "Notes" section of the docs page on whitening).

Similar in spirit to the post-processing step in the Marketing Measures package: https://github.com/Marketing-Measures/marketing-measures.

Parameters:

Name	Type	Description	Default
`scores`	`DataFrame`	DataFrame from :func:`score_documents`, with one column per dimension plus `Doc_ID` and `document_length`.	required
`dims`	`list[str]`	List of dimension column names to whiten.	required
`epsilon`	`float`	Eigenvalue floor for numerical stability. Raise if the covariance is near-singular (small corpora, highly correlated dimensions).	`1e-06`

Returns:

Type	Description
`DataFrame`	A new DataFrame with the same shape. Dimension columns are
`DataFrame`	whitened; every other column is copied through.