Skip to content

Seeds and vocabulary utilities

This page covers the functions and constants used to load seed dictionaries, access built-in example seeds, manage the SRAF stopword list, and install the optional CoreNLP dependency. These utilities are all importable directly from lmsy_w2v_rfs.

load_seeds

Read a seed dictionary from a plain Python dict, a .json file, or a .txt file with one dimension-name header per section.

Load a seed dictionary from a dict, a JSON file, or a text file.

The package is domain-agnostic: the seed dictionary is the only place where the user declares what concepts to measure. This helper accepts the three formats researchers typically have:

Python dict (pass through)::

{"integrity": ["integrity", "ethic", "honest"],
 "teamwork":  ["teamwork", "collaborate", "cooperate"]}

JSON file, flat (path ending in .json)::

{
  "integrity": ["integrity", "ethic", "honest"],
  "teamwork":  ["teamwork", "collaborate", "cooperate"]
}

JSON file, wrapped (has a "seeds" key whose value is the dict). Used by the bundled example files (seeds_culture.json) which carry extra metadata alongside the seeds::

{
  "_paper": "Li et al. 2021",
  "seeds": {
    "integrity": ["integrity", "ethic", "honest"],
    "teamwork":  ["teamwork", "collaborate", "cooperate"]
  }
}

Text file (anything else). One dimension per line, name and words separated by a colon, words by whitespace or commas. Blank lines and # comments are skipped::

# my dimensions
integrity: integrity ethic ethical honest
teamwork:  teamwork, collaboration, cooperate
innovation: innovation innovate creative

Parameters:

Name Type Description Default
source str | Path | dict[str, list[str]]

A dict, or a path to a .json or .txt file.

required

Returns:

Type Description
dict[str, list[str]]

Mapping of dimension name to seed word list.

Raises:

Type Description
FileNotFoundError

If source is a path that does not exist.

ValueError

If the JSON file is not a valid seed mapping.

TypeError

If source is None.

load_example_seeds

Retrieve a named seed dictionary that is bundled with the package. Currently ships one example: "culture_2021" (the 5-dimension, 47-word dictionary from Li, Mai, Shen, Yan 2021).

Load a named example seed dictionary shipped with the package.

The package itself is seed-agnostic. This helper is provided so users who want to reproduce a specific paper can opt in by name. The 2021 RFS paper's five-dimension culture dictionary is currently the only example.

Parameters:

Name Type Description Default
name str

Short identifier of the example. Currently: "culture_2021" (Li, Mai, Shen, Yan 2021, RFS).

required

Returns:

Type Description
dict[str, list[str]]

A fresh dict[str, list[str]] copy of the example seeds.

Raises:

Type Description
KeyError

If name is not a known example.

Example::

from lmsy_w2v_rfs import Config, load_example_seeds

seeds = load_example_seeds("culture_2021")
cfg = Config(seeds=seeds, preprocessor="none")

STOPWORDS_SRAF

A set of 121 generic stopwords drawn from the Loughran-McDonald Software-Readable Accounting Forms (SRAF) list. Passed to Config as stopwords=STOPWORDS_SRAF (the default).

121-token generic stopword list from Notre Dame SRAF.

download_corenlp

One-call helper that installs Stanford CoreNLP into the local cache directory. Requires the [corenlp] optional extra (pip install "lmsy_w2v_rfs[corenlp]").

Install Stanford CoreNLP into the local cache directory.

Imported lazily so the base install does not require stanza. Requires the [corenlp] extra.