Seeds and vocabulary utilities¶

This page covers the functions and constants used to load seed dictionaries, access built-in example seeds, manage the SRAF stopword list, and install the optional CoreNLP dependency. These utilities are all importable directly from lmsy_w2v_rfs.

load_seeds¶

Read a seed dictionary from a plain Python dict, a .json file, or a .txt file with one dimension-name header per section.

Load a seed dictionary from a dict, a JSON file, or a text file.

The package is domain-agnostic: the seed dictionary is the only place where the user declares what concepts to measure. This helper accepts the three formats researchers typically have:

Python dict (pass through)::

{"integrity": ["integrity", "ethic", "honest"],
 "teamwork":  ["teamwork", "collaborate", "cooperate"]}

JSON file, flat (path ending in .json)::

{
  "integrity": ["integrity", "ethic", "honest"],
  "teamwork":  ["teamwork", "collaborate", "cooperate"]
}

JSON file, wrapped (has a "seeds" key whose value is the dict). Used by the bundled example files (seeds_culture.json) which carry extra metadata alongside the seeds::

{
  "_paper": "Li et al. 2021",
  "seeds": {
    "integrity": ["integrity", "ethic", "honest"],
    "teamwork":  ["teamwork", "collaborate", "cooperate"]
  }
}

Text file (anything else). One dimension per line, name and words separated by a colon, words by whitespace or commas. Blank lines and # comments are skipped::

# my dimensions
integrity: integrity ethic ethical honest
teamwork:  teamwork, collaboration, cooperate
innovation: innovation innovate creative

Parameters:

Name	Type	Description	Default
`source`	`str \| Path \| dict[str, list[str]]`	A dict, or a path to a `.json` or `.txt` file.	required

Returns:

Type	Description
`dict[str, list[str]]`	Mapping of dimension name to seed word list.

Raises:

Type	Description
`FileNotFoundError`	If `source` is a path that does not exist.
`ValueError`	If the JSON file is not a valid seed mapping.
`TypeError`	If `source` is `None`.

load_example_seeds¶

Retrieve a named seed dictionary that is bundled with the package. Currently ships one example: "culture_2021" (the 5-dimension, 47-word dictionary from Li, Mai, Shen, Yan 2021).

Load a named example seed dictionary shipped with the package.

The package itself is seed-agnostic. This helper is provided so users who want to reproduce a specific paper can opt in by name. The 2021 RFS paper's five-dimension culture dictionary is currently the only example.

Parameters:

Name	Type	Description	Default
`name`	`str`	Short identifier of the example. Currently: `"culture_2021"` (Li, Mai, Shen, Yan 2021, RFS).	required

Returns:

Type	Description
`dict[str, list[str]]`	A fresh `dict[str, list[str]]` copy of the example seeds.

Raises:

Type	Description
`KeyError`	If `name` is not a known example.

Example::

from lmsy_w2v_rfs import Config, load_example_seeds

seeds = load_example_seeds("culture_2021")
cfg = Config(seeds=seeds, preprocessor="none")

STOPWORDS_SRAF¶

A set of 121 generic stopwords drawn from the Loughran-McDonald Software-Readable Accounting Forms (SRAF) list. Passed to Config as stopwords=STOPWORDS_SRAF (the default).

121-token generic stopword list from Notre Dame SRAF.

download_corenlp¶

One-call helper that installs Stanford CoreNLP into the local cache directory. Requires the [corenlp] optional extra (pip install "lmsy_w2v_rfs[corenlp]").

Install Stanford CoreNLP into the local cache directory.

Imported lazily so the base install does not require stanza. Requires the [corenlp] extra.