Skip to content

Dictionary expansion

The seed-expansion kernel grows each dimension's seed word list into a ranked dictionary using the trained Word2Vec model. The logic mirrors the RFS 2021 replication repo's culture/culture_dictionary.py step by step, ported to the gensim 4 API (model.wv.key_to_index). Pipeline.expand_dictionary wires these functions together; the same primitives are exported for researchers who want to run the expansion against their own Word2Vec model.

expand_words_dimension_mean

Averages in-vocab seed vectors for each dimension, takes the top-k nearest neighbors, and filters out NER tokens and cross-dimension seeds.

Expand each dimension's seed list by mean-vector nearest neighbors.

For each dimension: average the in-vocab seed vectors, find the top-n words by cosine similarity, filter out NER tokens, cross- dimension seeds, and any user-supplied stop set, then combine with the seeds.

Parameters:

Name Type Description Default
model Word2Vec

Trained Word2Vec model.

required
seeds dict[str, list[str]]

Mapping of dimension name to seed words.

required
n int

Top-k expansion per dimension.

500
restrict_vocab float | None

Restrict to the top fraction of vocab by frequency, or None to use the full vocab.

None
min_similarity float

Discard candidates below this cosine.

0.0
filter_words set[str] | None

Additional words to drop from expansion results.

None

Returns:

Type Description
dict[str, set[str]]

Mapping of dimension name to expanded word set.

deduplicate_keywords

Assigns words that loaded onto multiple dimensions to their single most similar dimension.

Assign cross-loading words to their most similar dimension.

Parameters:

Name Type Description Default
model Word2Vec

Trained Word2Vec model.

required
expanded dict[str, set[str]]

Output of expand_words_dimension_mean.

required
seeds dict[str, list[str]]

Original seed lists (in-vocab only entries are used).

required

Returns:

Type Description
dict[str, set[str]]

Deduplicated expansion mapping.

rank_by_similarity

Sorts each dimension's words by cosine similarity to its seed mean.

Sort each dimension's words by similarity to the seed mean.

Parameters:

Name Type Description Default
expanded dict[str, set[str]]

Deduplicated expansion.

required
seeds dict[str, list[str]]

Original seed lists.

required
model Word2Vec

Trained Word2Vec model.

required

Returns:

Type Description
dict[str, list[str]]

Mapping of dimension to sorted word list.

similarity_weights

Computes the 1 / ln(2 + rank) per-word weights used by the TFIDF+SIMWEIGHT and WFIDF+SIMWEIGHT scoring methods. See Scoring.

Compute the 1 / ln(2 + rank) word weights used for SIMWEIGHT scoring.

Parameters:

Name Type Description Default
culture_dict dict[str, list[str]]

Mapping of dimension to rank-sorted word list.

required

Returns:

Type Description
dict[str, float]

Mapping of word to weight.

read_dict_csv

Reads a dictionary CSV produced by write_dict_csv back into a (dimension_to_words, all_words) tuple.

Read an expanded dictionary CSV.

Parameters:

Name Type Description Default
path Path | str

CSV produced by write_dict_csv.

required

Returns:

Type Description
tuple[dict[str, list[str]], set[str]]

(dimension_to_words, all_words).

write_dict_csv

Writes an expanded dictionary to CSV with one column per dimension.

Write an expanded dictionary to CSV (one column per dimension).

Parameters:

Name Type Description Default
culture_dict dict[str, list[str]]

Mapping of dimension to word list.

required
path Path | str

Destination CSV.

required

Returns:

Type Description
Path

The destination path.