Skip to content

Scoring

The scoring kernel turns a doc-level corpus and an expanded dictionary into document-level scores. Three base methods are supported: TF (raw counts), TFIDF (tf * log(N/df)), and WFIDF ((1 + log tf) * log(N/df)). Each can be combined with the similarity_weights kernel to produce TFIDF+SIMWEIGHT and WFIDF+SIMWEIGHT. The streaming design avoids materializing the full corpus in memory: document frequencies and per-document text are built in a single pass over the sentence file.

Pipeline.score chains these primitives together; the same functions are exported for users who want to score a corpus directly.

ScoringMethod

The string literal accepted by score_document and score_documents. Valid values are "TF", "TFIDF", "WFIDF", "TFIDF+SIMWEIGHT", and "WFIDF+SIMWEIGHT".

iter_doc_level_corpus

Folds a sentence-level corpus back to document level by grouping consecutive sentence lines whose IDs share the same docID_ prefix.

Yield (doc_id, document_text) by folding sentences on their doc prefix.

The CoreNLP pass produces sentence IDs shaped docID_sentenceN. This function groups consecutive sentence lines with matching doc prefixes and yields one document at a time.

Parameters:

Name Type Description Default
sent_corpus_path Path | str

File of cleaned sentences, one per line.

required
sent_id_path Path | str

Matching sentence IDs.

required

Yields:

Type Description
tuple[str, str]

(doc_id, concatenated_document_text) pairs.

document_frequencies

Computes the document-frequency dictionary and total document count needed by every non-TF scoring method.

Compute document frequency for every token.

Parameters:

Name Type Description Default
documents Iterable[str]

Iterable of whitespace-tokenized documents.

required
show_progress bool

Print a tqdm bar.

True

Returns:

Type Description
tuple[dict[str, int], int]

(df_dict, n_documents).

score_document

Scores one document across all dimensions.

Score one document across all dimensions.

Parameters:

Name Type Description Default
document str

Whitespace-tokenized text.

required
expanded_words dict[str, list[str] | set[str]]

Expanded dictionary per dimension.

required
method ScoringMethod

One of TF, TFIDF, WFIDF, TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT.

'TF'
df_dict dict[str, int] | None

Document frequencies. Required for non-TF methods.

None
n_docs int | None

Total document count. Required for non-TF methods.

None
word_weights dict[str, float] | None

Per-word weights. Required for SIMWEIGHT methods.

None

Returns:

Type Description
tuple[list[float], int]

(scores_sorted_by_dim, document_length).

Raises:

Type Description
ValueError

On inconsistent arguments.

score_documents

Scores an iterable of (doc_id, text) pairs and returns a DataFrame with one row per document.

Score an iterable of documents and return a DataFrame.

Parameters:

Name Type Description Default
documents Iterable[tuple[str, str]]

Iterable of (doc_id, text) pairs.

required
expanded_words dict[str, list[str] | set[str]]

Expanded dictionary.

required
method ScoringMethod

Scoring method.

'TFIDF'
df_dict dict[str, int] | None

Document frequencies (non-TF methods).

None
n_docs int | None

Total document count (non-TF methods).

None
word_weights dict[str, float] | None

Per-word weights (SIMWEIGHT methods).

None
normalize bool

L2-normalize score vector per document.

False
show_progress bool

Print a tqdm bar.

True

Returns:

Type Description
DataFrame

DataFrame with Doc_ID, the sorted dimensions, and

DataFrame

document_length.

aggregate_to_firm_year

Joins document-level scores to a firm-year mapping, normalizes each dimension by document length (per 100 tokens), and averages within each firm-year cell.

Aggregate document-level scores to firm-year means.

Each dimension is first divided by document length and multiplied by 100 to put units in "per 100 tokens."

Parameters:

Name Type Description Default
scores DataFrame

DataFrame from score_documents.

required
id_to_firm DataFrame

DataFrame with the document-id to firm-year mapping.

required
dims list[str]

Dimension column names to normalize.

required
doc_id_col str

Document-ID column in scores.

'Doc_ID'
id_col str

Document-ID column in id_to_firm.

'document_id'
firm_col str

Firm-ID column in id_to_firm.

'firm_id'
time_col str

Time column in id_to_firm.

'time'

Returns:

Type Description
DataFrame

Firm-year DataFrame sorted by firm_id, time.

zca_whiten

Applies ZCA whitening to a matrix of document-level scores so that each dimension has unit variance and the dimensions are decorrelated. Useful when the seed dimensions overlap in embedding space.

Apply ZCA whitening to the dimension columns of a scores DataFrame.

ZCA (zero-phase component analysis) whitening is a linear transform that decorrelates the columns (makes the covariance the identity) while staying as close as possible to the original axes. Unlike PCA whitening, it does not rotate the data into a new basis, so after whitening the column named integrity still measures something close to integrity (not "principal component 1"). This matters when downstream analysis interprets each dimension by name.

This is a post-scoring transform. Input columns retain their names; output values are the whitened coordinates. Non-dimension columns (Doc_ID, document_length) pass through unchanged.

The whitening transform is fit on scores[dims] itself. If you plan to score new documents later and want them on the same whitened scale, compute and persist the transform separately (see the "Notes" section of the docs page on whitening).

Similar in spirit to the post-processing step in the Marketing Measures package: https://github.com/Marketing-Measures/marketing-measures.

Parameters:

Name Type Description Default
scores DataFrame

DataFrame from :func:score_documents, with one column per dimension plus Doc_ID and document_length.

required
dims list[str]

List of dimension column names to whiten.

required
epsilon float

Eigenvalue floor for numerical stability. Raise if the covariance is near-singular (small corpora, highly correlated dimensions).

1e-06

Returns:

Type Description
DataFrame

A new DataFrame with the same shape. Dimension columns are

DataFrame

whitened; every other column is copied through.