Whiten the dimension scores¶
Problem¶
Your five culture dimensions are correlated. A firm that scores high on integrity tends to score high on teamwork, because both pull on similar vocabulary. This inflates the apparent signal when you use the scores as regressors together. You want dimensions that are decorrelated, while keeping column names interpretable.
Solution¶
Apply ZCA (zero-phase component analysis) whitening as a post-scoring
step. ZCA is the whitening transform that makes the columns uncorrelated
with unit variance while staying as close as possible to the original
axes. Unlike PCA whitening, ZCA does not rotate the data into a new
basis: the column named integrity still measures something close to
integrity, not "principal component 1".
This is the same post-processing idea the Marketing-Measures/marketing-measures package uses for their firm-level measures.
Turn it on in Config¶
from lmsy_w2v_rfs import Pipeline, Config, load_example_seeds
seeds = load_example_seeds("culture_2021")
cfg = Config(seeds=seeds, zca_whiten=True)
p = Pipeline.from_csv("transcripts.csv", work_dir="runs/x", config=cfg)
p.run()
df = p.score_df("TFIDF") # columns are already whitened
Or whiten an existing scores DataFrame manually¶
from lmsy_w2v_rfs import zca_whiten
df_whitened = zca_whiten(
df, dims=["integrity", "teamwork", "innovation", "respect", "quality"],
)
CLI¶
lmsy-w2v-rfs run --input docs.txt --out runs/x --zca-whiten
Notes¶
- In-sample fit: the transform is computed from
scores[dims]itself, so the decorrelation is exact on the data you pass in. If you want new documents to land on the same whitened scale, compute the whitening matrix on a reference corpus and cache it; apply the cached matrix to future scores. The current implementation does not split fit and transform; for now, fit on the full corpus you plan to analyze. epsilon: eigenvalue floor for numerical stability. Default1e-6. Raise to1e-4if the covariance is near-singular (small corpus, or dimensions that are nearly-degenerate).- Order of operations: whitening runs after tf-idf weighting and
after L2 normalization (if
Config(tfidf_normalize=True)is set) but beforefirm_yearaggregation. Firm-year means are over whitened document scores. - Interaction with firm-year aggregation: the per-document divide-by-
length and scale-to-per-100-tokens steps in
aggregate_to_firm_yearare applied to the whitened columns. If that is not what you want, skipfirm_yearand aggregate yourself.
Related¶
- Scoring: the tf-idf formulas that produce the pre-whitening columns.
- Reference: scoring: API docstring for
zca_whiten.