Scoring¶
The scoring kernel turns a doc-level corpus and an expanded
dictionary into document-level scores. Three base
methods are supported: TF (raw counts), TFIDF (tf * log(N/df)), and
WFIDF ((1 + log tf) * log(N/df)). Each can be combined with the
similarity_weights kernel to produce TFIDF+SIMWEIGHT and WFIDF+SIMWEIGHT.
The streaming design avoids materializing the full corpus in memory: document
frequencies and per-document text are built in a single pass over the sentence
file.
Pipeline.score chains these primitives together; the same functions are
exported for users who want to score a corpus directly.
ScoringMethod¶
The string literal accepted by score_document and score_documents. Valid
values are "TF", "TFIDF", "WFIDF", "TFIDF+SIMWEIGHT", and
"WFIDF+SIMWEIGHT".
iter_doc_level_corpus¶
Folds a sentence-level corpus back to document level by grouping consecutive
sentence lines whose IDs share the same docID_ prefix.
Yield (doc_id, document_text) by folding sentences on their doc prefix.
The CoreNLP pass produces sentence IDs shaped docID_sentenceN.
This function groups consecutive sentence lines with matching doc
prefixes and yields one document at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sent_corpus_path
|
Path | str
|
File of cleaned sentences, one per line. |
required |
sent_id_path
|
Path | str
|
Matching sentence IDs. |
required |
Yields:
| Type | Description |
|---|---|
tuple[str, str]
|
|
document_frequencies¶
Computes the document-frequency dictionary and total document count needed by every non-TF scoring method.
Compute document frequency for every token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents
|
Iterable[str]
|
Iterable of whitespace-tokenized documents. |
required |
show_progress
|
bool
|
Print a tqdm bar. |
True
|
Returns:
| Type | Description |
|---|---|
tuple[dict[str, int], int]
|
|
score_document¶
Scores one document across all dimensions.
Score one document across all dimensions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
str
|
Whitespace-tokenized text. |
required |
expanded_words
|
dict[str, list[str] | set[str]]
|
Expanded dictionary per dimension. |
required |
method
|
ScoringMethod
|
One of TF, TFIDF, WFIDF, TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT. |
'TF'
|
df_dict
|
dict[str, int] | None
|
Document frequencies. Required for non-TF methods. |
None
|
n_docs
|
int | None
|
Total document count. Required for non-TF methods. |
None
|
word_weights
|
dict[str, float] | None
|
Per-word weights. Required for SIMWEIGHT methods. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[list[float], int]
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
On inconsistent arguments. |
score_documents¶
Scores an iterable of (doc_id, text) pairs and returns a DataFrame with one
row per document.
Score an iterable of documents and return a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents
|
Iterable[tuple[str, str]]
|
Iterable of |
required |
expanded_words
|
dict[str, list[str] | set[str]]
|
Expanded dictionary. |
required |
method
|
ScoringMethod
|
Scoring method. |
'TFIDF'
|
df_dict
|
dict[str, int] | None
|
Document frequencies (non-TF methods). |
None
|
n_docs
|
int | None
|
Total document count (non-TF methods). |
None
|
word_weights
|
dict[str, float] | None
|
Per-word weights (SIMWEIGHT methods). |
None
|
normalize
|
bool
|
L2-normalize score vector per document. |
False
|
show_progress
|
bool
|
Print a tqdm bar. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with |
DataFrame
|
|
aggregate_to_firm_year¶
Joins document-level scores to a firm-year mapping, normalizes each dimension by document length (per 100 tokens), and averages within each firm-year cell.
Aggregate document-level scores to firm-year means.
Each dimension is first divided by document length and multiplied by 100 to put units in "per 100 tokens."
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
DataFrame
|
DataFrame from |
required |
id_to_firm
|
DataFrame
|
DataFrame with the document-id to firm-year mapping. |
required |
dims
|
list[str]
|
Dimension column names to normalize. |
required |
doc_id_col
|
str
|
Document-ID column in |
'Doc_ID'
|
id_col
|
str
|
Document-ID column in |
'document_id'
|
firm_col
|
str
|
Firm-ID column in |
'firm_id'
|
time_col
|
str
|
Time column in |
'time'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Firm-year DataFrame sorted by |
zca_whiten¶
Applies ZCA whitening to a matrix of document-level scores so that each dimension has unit variance and the dimensions are decorrelated. Useful when the seed dimensions overlap in embedding space.
Apply ZCA whitening to the dimension columns of a scores DataFrame.
ZCA (zero-phase component analysis) whitening is a linear transform that
decorrelates the columns (makes the covariance the identity) while
staying as close as possible to the original axes. Unlike PCA whitening,
it does not rotate the data into a new basis, so after whitening the
column named integrity still measures something close to integrity
(not "principal component 1"). This matters when downstream analysis
interprets each dimension by name.
This is a post-scoring transform. Input columns retain their names;
output values are the whitened coordinates. Non-dimension columns
(Doc_ID, document_length) pass through unchanged.
The whitening transform is fit on scores[dims] itself. If you plan
to score new documents later and want them on the same whitened scale,
compute and persist the transform separately (see the "Notes" section
of the docs page on whitening).
Similar in spirit to the post-processing step in the Marketing Measures package: https://github.com/Marketing-Measures/marketing-measures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
DataFrame
|
DataFrame from :func: |
required |
dims
|
list[str]
|
List of dimension column names to whiten. |
required |
epsilon
|
float
|
Eigenvalue floor for numerical stability. Raise if the covariance is near-singular (small corpora, highly correlated dimensions). |
1e-06
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A new DataFrame with the same shape. Dimension columns are |
DataFrame
|
whitened; every other column is copied through. |