Aggregate document scores¶
Problem¶
The pipeline scores every document independently: one row per document, one column per dimension. Many analyses want grouped panels: for example, firm-year panels for corporate-finance research or product-year panels for market studies. You need to (a) map each document to its firm and fiscal year, (b) normalize by document length so long transcripts do not dominate, (c) average across all documents within a firm-year, and (d) merge the resulting panel with external firm covariates (total assets, industry code, returns, whatever your regression calls for).
Solution¶
Build an id_to_firm DataFrame with columns document_id, firm_id, and
time, then call p.firm_year(id_to_firm, method="TFIDF"). The method
normalizes scores to "per 100 tokens," averages within (firm_id, time)
groups, and returns a sorted panel.
End-to-end example¶
import pandas as pd
from lmsy_w2v_rfs import Pipeline, Config, load_example_seeds
seeds = load_example_seeds("culture_2021")
# 1. Run the pipeline as usual.
p = Pipeline(
texts=transcripts,
doc_ids=transcript_ids, # e.g., ["AAPL_2021Q1", "AAPL_2021Q2", ...]
work_dir="runs/firm_panel",
config=Config(seeds=seeds, preprocessor="corenlp", n_cores=8),
)
p.run(methods=("TFIDF",))
# 2. Build the document-to-firm-year mapping.
id_to_firm = pd.DataFrame({
"document_id": ["AAPL_2021Q1", "AAPL_2021Q2", "AAPL_2021Q3", "AAPL_2021Q4",
"MSFT_2021Q1", "MSFT_2021Q2"],
"firm_id": ["AAPL", "AAPL", "AAPL", "AAPL",
"MSFT", "MSFT"],
"time": [2021, 2021, 2021, 2021,
2021, 2021],
})
# 3. Aggregate.
panel = p.firm_year(id_to_firm, method="TFIDF")
print(panel)
Expected shape: one row per (firm_id, time) combination, with the five
culture dimensions as columns. For the example above, two rows total (AAPL
2021, MSFT 2021), each the mean of the four or two document-level scores.
What per-100-tokens normalization means¶
Raw scores scale with document length: a 5,000-word transcript mentions
innovation more often than a 500-word one, all else equal. To put all
documents on a comparable footing, firm_year divides each dimension by
document_length and multiplies by 100:
score_per_100_tokens = 100 * raw_score / document_length
The mean over (firm_id, time) is then taken on the normalized scores. If you
want the raw document scores for your own aggregation, pull them directly
from p.score_df("TFIDF") and skip firm_year.
Merging with external firm covariates¶
A realistic research flow: take panel, merge a Compustat extract, run a
panel regression.
import pandas as pd
covariates = pd.read_csv("compustat_firm_year.csv")
# covariates has columns: firm_id, time, at (total assets), roa, ind2
merged = panel.merge(covariates, on=["firm_id", "time"], how="inner")
# Standardize dimension scores within year for readability of coefficients.
dims = list(seeds.keys()) # or whatever column names your pipeline used
for dim in dims:
merged[dim + "_z"] = merged.groupby("time")[dim].transform(
lambda s: (s - s.mean()) / s.std()
)
merged.to_parquet("panel_with_covariates.parquet")
From here you run your preferred fixed-effects regression in statsmodels, linearmodels, or R. The pipeline's job ends at the panel export.
Using a different aggregation window¶
time is an opaque label. Pass fiscal quarter (2021Q1), fiscal year
(2021), or decade ("2020s") strings or integers; firm_year groups on
whatever is in the column. The method name is historical: the aggregation is
really "group by firm and time, whatever time means to you."
For a firm-quarter panel:
id_to_firm = pd.DataFrame({
"document_id": [...],
"firm_id": [...],
"time": ["2021Q1", "2021Q2", "2021Q3", ...],
})
panel = p.firm_year(id_to_firm, method="TFIDF")
Gotcha: document IDs that do not match¶
p.firm_year does a left merge of document scores onto id_to_firm. Any
doc_id that is not in id_to_firm["document_id"] drops out at aggregation
(NaN firm_id and time produce a dropped group after groupby). Any
document_id in id_to_firm that has no matching score in scores_TFIDF.csv
is silently ignored. Sanity-check counts before running regressions:
scores = p.score_df("TFIDF")
expected = set(scores["Doc_ID"])
mapped = set(id_to_firm["document_id"])
print(f"scored but unmapped: {len(expected - mapped)}")
print(f"mapped but unscored: {len(mapped - expected)}")
Gotcha: expected column names¶
The aggregator reads specific column names from id_to_firm. If your
DataFrame uses gvkey / fyear, rename before passing:
id_to_firm = firm_data.rename(columns={
"transcript_id": "document_id",
"gvkey": "firm_id",
"fyear": "time",
})
The lower-level aggregate_to_firm_year function takes id_col=,
firm_col=, time_col= kwargs if renaming in place is inconvenient, but
Pipeline.firm_year does not expose those and always expects the canonical
three names.