`lmsy_w2v_rfs`: Word2Vec dictionary expansion and document scoring¶

lmsy_w2v_rfs implements the Word2Vec seed-expansion method for document scoring introduced in Li, Mai, Shen, and Yan (2021). A researcher specifies a small set of seed words for each concept to be measured; the package trains Word2Vec on the target corpus, expands each concept's seeds into a corpus-specific dictionary of related words and multi-word phrases, and produces document-level scores by TF-IDF–weighted dictionary matching.

Citation¶

If you find the package useful, please cite the paper the method is based on:

Li, Kai, Feng Mai, Rui Shen, and Xinyan Yan (2021), "Measuring Corporate Culture Using Machine Learning," Review of Financial Studies 34(7):3265–3315, doi.org/10.1093/rfs/hhaa079.

BibTeX

@article{li2021measuring,
  title={Measuring Corporate Culture Using Machine Learning},
  author={Li, Kai and Mai, Feng and Shen, Rui and Yan, Xinyan},
  journal={The Review of Financial Studies},
  volume={34}, number={7}, pages={3265--3315}, year={2021},
  doi={10.1093/rfs/hhaa079}
}

This package is a general update of the paper's method. The original code for the paper is at MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning.

Install¶

pip install -U lmsy_w2v_rfs

The base install runs out of the box with preprocessor="none" (whitespace tokenization). For richer Phase 1 parsing (lemmatization, named-entity masking, and dependency-based multi-word expressions) install an optional backend:

pip install -U "lmsy_w2v_rfs[spacy]" && python -m spacy download en_core_web_sm

For reproduction of the 2021 paper, use the CoreNLP backend (slower; needs Java and a one-time ~1 GB download):

pip install -U "lmsy_w2v_rfs[corenlp]"
lmsy-w2v-rfs download-corenlp

Quickstart¶

Try it now, no install, runs on a bundled 2,000-review demo corpus.

Researchers usually start from a table of documents. Point the pipeline at a CSV, declare a few seed words per concept, and run:

from lmsy_w2v_rfs import Pipeline, Config

seeds = {
    "innovation":   ["innovation", "innovative", "creativity", "creative"],
    "teamwork":     ["teamwork", "collaboration", "collaborate", "supportive"],
    "compensation": ["pay", "salary", "compensation", "benefits", "bonus"],
}

p = Pipeline.from_csv(
    "reviews.csv", text_col="text", id_col="review_id",
    work_dir="runs/quickstart",
    config=Config(seeds=seeds),     # preprocessor="none" by default
)
p.run()                     # phrase + train + expand + score
p.show_dictionary(top_k=10) # inspect the expanded dictionary
print(p.score_df("TFIDF"))  # per-document scores

The first two concepts come from the paper's culture construct; compensation is a concept outside the five culture dimensions, included to show the method generalizes. On a corpus of employee reviews, the expansion fills each concept with the corpus's own vocabulary — note that the seeds never mentioned 401k_match, dental, or mentorship:

=== innovation ===
  seeds:    innovation, innovative, creativity, creative
  expanded: creativity, entrepreneurial, passion, open_communication, fostering
=== teamwork ===
  seeds:    teamwork, collaboration, collaborate, supportive
  expanded: collaboration, inclusion, mutual_respect, caring, fosters
=== compensation ===
  seeds:    pay, salary, compensation, benefits, bonus
  expanded: salary, competitive, 401k_match, dental, medical, bonuses

Reproducing Li et al. (2021)¶

The package ships the paper's 47 seed words across five culture dimensions, and the CoreNLP backend reproduces the paper's Phase 1 parsing:

from lmsy_w2v_rfs import Pipeline, Config, load_example_seeds

seeds = load_example_seeds("culture_2021")    # 47 seeds, 5 dimensions
config = Config(seeds=seeds, preprocessor="corenlp")  # needs Java; see Install

The construction procedure¶

The package implements the four-step construction procedure of Li et al. (2021). Each step is a method on Pipeline; calling .run() executes them in order and saves intermediate artifacts under work_dir/ so any step can be redone without redoing the others.

Step 1: Two-step phrase construction¶

Phrases carry meaning that single words cannot. The package extracts them in two complementary steps targeting different kinds of phrases.

Step 1a, parser-based (general-English phrases). A dependency parser identifies fixed multiword expressions (with_respect_to, rather_than) and compound words (intellectual_property, healthcare_provider). The parser also lemmatizes (stocks → stock) and masks named entities as [NER:ORG] placeholders so proper nouns do not bias the vector space. The 121-token SRAF generic stopword list is removed in the cleaning pass that follows.

`Config(preprocessor=...)`	Backend	Needs
`"none"` (default)	whitespace tokenize, lowercase only	base install
`"static"`	NLTK `MWETokenizer` over a curated list	base install
`"spacy"`	spaCy (lemmas, NER, dependency MWEs)	`[spacy]` extra + a model
`"corenlp"` (paper-faithful)	Stanford CoreNLP via `stanza.server`	`[corenlp]` extra + Java
`"stanza"`	stanza `Pipeline`	`[stanza]` extra

Step 1b, statistical (corpus-specific phrases). After Step 1a, gensim's Phrases scans the parsed corpus for statistically significant adjacent-token co-occurrences and joins them with _. A second pass over the bigram-joined corpus learns trigrams. This step identifies recurring collocations specific to the corpus: an earnings-call corpus surfaces forward_looking_statement and cost_of_capital; a product-review corpus surfaces customer_service and delivery_time; a Glassdoor corpus surfaces work_life_balance and growth_opportunity.

from lmsy_w2v_rfs import Config, load_example_seeds

seeds = load_example_seeds("culture_2021")  # or any dict[str, list[str]]
Config(
    seeds=seeds,
    use_gensim_phrases=True,
    phrase_passes=2,            # 1 = bigrams; 2 = bigrams + trigrams
    phrase_min_count=10,        # works on a ~270k-doc corpus
    phrase_threshold=10.0,      # for smaller corpora try 3 / 5.0
)

The phrase-tagged corpus is written to work_dir/corpora/pass2.txt and can be opened directly to inspect the joined phrases.

Step 2: Word2Vec¶

Pipeline.train() fits a gensim.models.Word2Vec on the phrase-tagged corpus. Every word and phrase receives a 300-dimensional vector. Defaults match the 2021 paper:

from lmsy_w2v_rfs import Config, load_example_seeds

seeds = load_example_seeds("culture_2021")  # or any dict[str, list[str]]
Config(seeds=seeds, w2v_dim=300, w2v_window=5, w2v_min_count=5, w2v_epochs=20)

The model is saved at work_dir/models/w2v.mod and is available as p.w2v for ad-hoc queries.

Step 3: Seed expansion¶

Pipeline.expand_dictionary() builds the per-concept dictionary by:

Averaging the in-vocabulary seed vectors for the concept.
Taking the top n_words_dim (default 500) tokens by cosine similarity to that mean.
Resolving cross-loadings: a token close to multiple concepts is assigned to the one whose seed mean it is closest to.
Dropping [NER:*] placeholders so named entities never enter the dictionary.

The result is written to work_dir/outputs/expanded_dict.csv, one column per concept, sorted by descending similarity to the seed mean.

p.show_dictionary(top_k=10)         # prints per-concept seeds + top expansions
p.dictionary_preview(top_k=10)      # DataFrame for notebook display

Step 4: Manual dictionary inspection¶

Nearest-neighbor expansion surfaces noise: off-topic terms, industry-specific outliers, words too general to be informative. Two ways to remove them, both atomic across the in-memory dictionary and the on-disk CSV:

# Programmatic, replicable in a notebook:
p.edit_dictionary(
    remove={"innovation": ["fantastic", "incredible"]},
    add={"innovation": ["patent"]},
)

# Spreadsheet-driven, faster on a big dictionary:
#   1. open p.dict_path in Excel or any text editor
#   2. edit, save
#   3. p.reload_dictionary()

Cached scores are dropped after curation. Call p.score() to rescore against the curated dictionary.

Scoring¶

A document's score on a concept is the sum of TF-IDF weights for every dictionary token present in the document, divided by total document length.

Method	Weight per dictionary hit	Source
`TFIDF`	`tf · log(N/df)`	2021 paper (the published measure)
`TF`	`tf`	alternative
`WFIDF`	`(1 + log tf) · log(N/df)`	alternative (sublinear `tf`)
`TFIDF+SIMWEIGHT`, `WFIDF+SIMWEIGHT`	× `1/ln(2 + rank)`	rank-weighted variant

The +SIMWEIGHT variants additionally weight each word by its rank in the similarity-ordered dictionary (1/ln(2 + rank)), so words nearer the seed centroid count more and peripheral expansion words count less. The weight depends on rank alone — the cosine similarities enter only by setting that ranking. This rank-based similarity weighting is the scheme several studies building on the method have adopted.

p.score(methods=("TFIDF",))
p.score_df("TFIDF")

Outputs land at work_dir/outputs/scores_<METHOD>.csv.

Which words drive a dimension?¶

To validate dictionary quality, decompose each dimension's score into the contribution of each dictionary word across the corpus:

contrib = p.word_contributions("TFIDF")   # dimension, word, contribution, relative, cumulative

This writes work_dir/outputs/word_contributions_<METHOD>.csv and shows, per dimension, each word's share and the running cumulative share — the standard way to check that (say) innovation is driven by genuine innovation terms rather than a few high-IDF artifacts.

Large corpora¶

Once parsing finishes, downstream stages stream through disk: clean reads parsed sentences line by line; phrase and train use gensim's PathLineSentences so the training corpus is never fully materialized. The bottleneck is the input stage: the document loader holds the corpus in a Python list before parsing begins.

For corpora beyond a few hundred thousand documents, or when running on a cluster, see the Run on HPC how-to for the multi-shard workflow, SLURM and SGE templates, and BLAS thread-cap instructions.

Configuration parameters¶

Config(
    seeds=...,                         # required: dict[str, list[str]]

    # Step 1a
    preprocessor="none",               # "none" | "static" | "spacy" | "corenlp" | "stanza"
    mwe_list=None,                     # None | "finance" | path to a curated list
    spacy_model="en_core_web_sm",
    parse_chunk_size=0,                # >0 processes docs in batches (caps memory on big corpora)
    n_cores=4,
    corenlp_memory="6G",
    corenlp_port=9002,
    corenlp_timeout_ms=120_000,        # per-request CoreNLP timeout (ms)
    corenlp_max_char_length=1_000_000, # raise for very long transcripts
    corenlp_properties={},             # extra CoreNLP server properties (override/add)

    # Step 1b
    use_gensim_phrases=True,
    phrase_passes=2,
    phrase_threshold=10.0,
    phrase_min_count=10,
    phrase_extra={},                   # extra kwargs -> gensim Phrases (e.g. {"scoring": "npmi"})

    # Step 2
    w2v_dim=300,
    w2v_window=5,
    w2v_min_count=5,
    w2v_epochs=20,
    w2v_sg=0,                          # 0 = CBOW (matches the original); 1 = skip-gram
    w2v_extra={},                      # extra kwargs -> gensim Word2Vec (e.g. {"negative": 10, "hs": 0})

    # Step 3
    n_words_dim=500,                   # package default: top-k expanded words per dimension
    dict_restrict_vocab=None,
    min_similarity=0.0,

    # Scoring (extensions beyond the 2021 paper)
    tfidf_normalize=False,
    zca_whiten=False,                  # ZCA-decorrelate the concept columns; see docs/how-to/whiten-scores.md
    zca_epsilon=1e-6,

    random_state=42,
)

Resources¶

Documentation (concepts, how-to guides, API reference): maifeng.github.io/lmsy_w2v_rfs
Original implementation: MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning
License: MIT. The citation and BibTeX are at the top.

Next in the documentation¶

Concepts goes deeper on two-phase preprocessing, the Word2Vec dictionary, and scoring.
How-to has task-oriented recipes: load documents, use your own seeds, switch preprocessor, install CoreNLP, whiten scores, resume after a crash, aggregate document scores, run on HPC, run from CLI, troubleshooting.
Reference is the full API: Pipeline, Config, preprocessors, dictionary, scoring, Word2Vec, seeds, CLI.
Explanation covers the "why": design decisions, preprocessor comparison.

lmsy_w2v_rfs: Word2Vec dictionary expansion and document scoring¶