Seeds and vocabulary utilities¶
This page covers the functions and constants used to load seed dictionaries,
access built-in example seeds, manage the SRAF stopword list, and install the
optional CoreNLP dependency. These utilities are all importable directly from
lmsy_w2v_rfs.
load_seeds¶
Read a seed dictionary from a plain Python dict, a .json file, or a
.txt file with one dimension-name header per section.
Load a seed dictionary from a dict, a JSON file, or a text file.
The package is domain-agnostic: the seed dictionary is the only place where the user declares what concepts to measure. This helper accepts the three formats researchers typically have:
Python dict (pass through)::
{"integrity": ["integrity", "ethic", "honest"],
"teamwork": ["teamwork", "collaborate", "cooperate"]}
JSON file, flat (path ending in .json)::
{
"integrity": ["integrity", "ethic", "honest"],
"teamwork": ["teamwork", "collaborate", "cooperate"]
}
JSON file, wrapped (has a "seeds" key whose value is the dict).
Used by the bundled example files (seeds_culture.json) which carry
extra metadata alongside the seeds::
{
"_paper": "Li et al. 2021",
"seeds": {
"integrity": ["integrity", "ethic", "honest"],
"teamwork": ["teamwork", "collaborate", "cooperate"]
}
}
Text file (anything else). One dimension per line, name and words
separated by a colon, words by whitespace or commas. Blank lines and
# comments are skipped::
# my dimensions
integrity: integrity ethic ethical honest
teamwork: teamwork, collaboration, cooperate
innovation: innovation innovate creative
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path | dict[str, list[str]]
|
A dict, or a path to a |
required |
Returns:
| Type | Description |
|---|---|
dict[str, list[str]]
|
Mapping of dimension name to seed word list. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
ValueError
|
If the JSON file is not a valid seed mapping. |
TypeError
|
If |
load_example_seeds¶
Retrieve a named seed dictionary that is bundled with the package. Currently
ships one example: "culture_2021" (the 5-dimension, 47-word dictionary
from Li, Mai, Shen, Yan 2021).
Load a named example seed dictionary shipped with the package.
The package itself is seed-agnostic. This helper is provided so users who want to reproduce a specific paper can opt in by name. The 2021 RFS paper's five-dimension culture dictionary is currently the only example.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Short identifier of the example. Currently:
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, list[str]]
|
A fresh |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
Example::
from lmsy_w2v_rfs import Config, load_example_seeds
seeds = load_example_seeds("culture_2021")
cfg = Config(seeds=seeds, preprocessor="none")
STOPWORDS_SRAF¶
A set of 121 generic stopwords drawn from the Loughran-McDonald
Software-Readable Accounting Forms (SRAF) list. Passed to Config as
stopwords=STOPWORDS_SRAF (the default).
download_corenlp¶
One-call helper that installs Stanford CoreNLP into the local cache
directory. Requires the [corenlp] optional extra (pip install
"lmsy_w2v_rfs[corenlp]").