Dictionary expansion¶

The seed-expansion kernel grows each dimension's seed word list into a ranked dictionary using the trained Word2Vec model. The logic mirrors the RFS 2021 replication repo's culture/culture_dictionary.py step by step, ported to the gensim 4 API (model.wv.key_to_index). Pipeline.expand_dictionary wires these functions together; the same primitives are exported for researchers who want to run the expansion against their own Word2Vec model.

expand_words_dimension_mean¶

Averages in-vocab seed vectors for each dimension, takes the top-k nearest neighbors, and filters out NER tokens and cross-dimension seeds.

Expand each dimension's seed list by mean-vector nearest neighbors.

For each dimension: average the in-vocab seed vectors, find the top-n words by cosine similarity, filter out NER tokens, cross- dimension seeds, and any user-supplied stop set, then combine with the seeds.

Parameters:

Name	Type	Description	Default
`model`	`Word2Vec`	Trained Word2Vec model.	required
`seeds`	`dict[str, list[str]]`	Mapping of dimension name to seed words.	required
`n`	`int`	Top-k expansion per dimension.	`500`
`restrict_vocab`	`float \| None`	Restrict to the top fraction of vocab by frequency, or `None` to use the full vocab.	`None`
`min_similarity`	`float`	Discard candidates below this cosine.	`0.0`
`filter_words`	`set[str] \| None`	Additional words to drop from expansion results.	`None`

Returns:

Type	Description
`dict[str, set[str]]`	Mapping of dimension name to expanded word set.

deduplicate_keywords¶

Assigns words that loaded onto multiple dimensions to their single most similar dimension.

Assign cross-loading words to their most similar dimension.

Parameters:

Name	Type	Description	Default
`model`	`Word2Vec`	Trained Word2Vec model.	required
`expanded`	`dict[str, set[str]]`	Output of `expand_words_dimension_mean`.	required
`seeds`	`dict[str, list[str]]`	Original seed lists (in-vocab only entries are used).	required

Returns:

Type	Description
`dict[str, set[str]]`	Deduplicated expansion mapping.

rank_by_similarity¶

Sorts each dimension's words by cosine similarity to its seed mean.

Sort each dimension's words by similarity to the seed mean.

Parameters:

Name	Type	Description	Default
`expanded`	`dict[str, set[str]]`	Deduplicated expansion.	required
`seeds`	`dict[str, list[str]]`	Original seed lists.	required
`model`	`Word2Vec`	Trained Word2Vec model.	required

Returns:

Type	Description
`dict[str, list[str]]`	Mapping of dimension to sorted word list.

similarity_weights¶

Computes the 1 / ln(2 + rank) per-word weights used by the TFIDF+SIMWEIGHT and WFIDF+SIMWEIGHT scoring methods. See Scoring.

Compute the 1 / ln(2 + rank) word weights used for SIMWEIGHT scoring.

Parameters:

Name	Type	Description	Default
`culture_dict`	`dict[str, list[str]]`	Mapping of dimension to rank-sorted word list.	required

Returns:

Type	Description
`dict[str, float]`	Mapping of word to weight.

read_dict_csv¶

Reads a dictionary CSV produced by write_dict_csv back into a (dimension_to_words, all_words) tuple.

Read an expanded dictionary CSV.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	CSV produced by `write_dict_csv`.	required

Returns:

Type	Description
`tuple[dict[str, list[str]], set[str]]`	`(dimension_to_words, all_words)`.

write_dict_csv¶

Writes an expanded dictionary to CSV with one column per dimension.

Write an expanded dictionary to CSV (one column per dimension).

Parameters:

Name	Type	Description	Default
`culture_dict`	`dict[str, list[str]]`	Mapping of dimension to word list.	required
`path`	`Path \| str`	Destination CSV.	required

Returns:

Type	Description
`Path`	The destination path.