Dictionary expansion¶
The seed-expansion kernel grows each dimension's seed word list into a ranked
dictionary using the trained Word2Vec model. The logic mirrors the RFS 2021
replication repo's culture/culture_dictionary.py step by step, ported to the
gensim 4 API (model.wv.key_to_index). Pipeline.expand_dictionary wires
these functions together; the same primitives are exported for researchers
who want to run the expansion against their own Word2Vec model.
expand_words_dimension_mean¶
Averages in-vocab seed vectors for each dimension, takes the top-k nearest neighbors, and filters out NER tokens and cross-dimension seeds.
Expand each dimension's seed list by mean-vector nearest neighbors.
For each dimension: average the in-vocab seed vectors, find the
top-n words by cosine similarity, filter out NER tokens, cross-
dimension seeds, and any user-supplied stop set, then combine with
the seeds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Word2Vec
|
Trained Word2Vec model. |
required |
seeds
|
dict[str, list[str]]
|
Mapping of dimension name to seed words. |
required |
n
|
int
|
Top-k expansion per dimension. |
500
|
restrict_vocab
|
float | None
|
Restrict to the top fraction of vocab by
frequency, or |
None
|
min_similarity
|
float
|
Discard candidates below this cosine. |
0.0
|
filter_words
|
set[str] | None
|
Additional words to drop from expansion results. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Mapping of dimension name to expanded word set. |
deduplicate_keywords¶
Assigns words that loaded onto multiple dimensions to their single most similar dimension.
Assign cross-loading words to their most similar dimension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Word2Vec
|
Trained Word2Vec model. |
required |
expanded
|
dict[str, set[str]]
|
Output of |
required |
seeds
|
dict[str, list[str]]
|
Original seed lists (in-vocab only entries are used). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Deduplicated expansion mapping. |
rank_by_similarity¶
Sorts each dimension's words by cosine similarity to its seed mean.
Sort each dimension's words by similarity to the seed mean.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expanded
|
dict[str, set[str]]
|
Deduplicated expansion. |
required |
seeds
|
dict[str, list[str]]
|
Original seed lists. |
required |
model
|
Word2Vec
|
Trained Word2Vec model. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, list[str]]
|
Mapping of dimension to sorted word list. |
similarity_weights¶
Computes the 1 / ln(2 + rank) per-word weights used by the TFIDF+SIMWEIGHT
and WFIDF+SIMWEIGHT scoring methods. See Scoring.
Compute the 1 / ln(2 + rank) word weights used for SIMWEIGHT scoring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
culture_dict
|
dict[str, list[str]]
|
Mapping of dimension to rank-sorted word list. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Mapping of word to weight. |
read_dict_csv¶
Reads a dictionary CSV produced by write_dict_csv back into a
(dimension_to_words, all_words) tuple.
Read an expanded dictionary CSV.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
CSV produced by |
required |
Returns:
| Type | Description |
|---|---|
tuple[dict[str, list[str]], set[str]]
|
|
write_dict_csv¶
Writes an expanded dictionary to CSV with one column per dimension.