Skip to content

Config

Config is a frozen dataclass that holds every hyperparameter the pipeline needs. It composes the two construction phases: Phase 1 (parser-based MWE joining and NER masking, with an optional static MWE post-pass) and Phase 2 (gensim Phrases bigram / trigram learning). Field defaults mirror the RFS 2021 replication repo's global_options.py where applicable. Use Config.with_(...) to copy-and-override individual fields without rebuilding the whole object.

For the rationale behind the two-phase split see Two-phase preprocessing. For the per-backend trade-offs see Preprocessors.

Hyperparameters for the seed-expansion pipeline.

Two construction phases compose to build the training corpus:

  • Phase 1 (preprocessor + optional static MWE post-pass). Lemmatize, mask named entities as [NER:TYPE], join multi-word expressions. Pick a preprocessor with preprocessor; optionally apply a curated static MWE list as a second pass with mwe_list.
  • Phase 2 (gensim Phrases). Learns corpus-specific bigrams and trigrams via co-occurrence statistics. On by default.

Seeds are required and have no default. Pass any mapping of dimension name to seed words; the package is theory-agnostic.

Attributes:

Name Type Description
seeds dict[str, list[str]]

Mapping of dimension name to seed word list. Required.

stopwords set[str]

Lowercased stopwords removed during cleaning.

preprocessor PreprocessorName

Backend to use for Phase 1. One of corenlp (default; CoreNLP server via stanza.server; needs Java and the [corenlp] extra; paper-exact), spacy (faster on modern hardware; needs pip install spacy), stanza (stanza.Pipeline; Python-native), static (NLTK MWETokenizer with the packaged list; Java-free), none (whitespace split only; Java-free, for already-tokenized input).

mwe_list str | Path | None

Curated MWE list applied AFTER the main preprocessor as a second pass. "finance" for the packaged list, a path for your own, or None (default) to skip.

spacy_model str

Name of the spaCy model when preprocessor="spacy".

corenlp_memory str

JVM heap for the CoreNLP server.

corenlp_port int

TCP port the CoreNLP server listens on.

corenlp_timeout_ms int

Per-request CoreNLP timeout.

n_cores int

Parallel workers for parsing and training.

use_gensim_phrases bool

Whether to run gensim Phrases.

phrase_passes int

Number of phrase passes (1 bigram, 2 bigram+trigram).

phrase_threshold float

gensim Phrases score threshold.

phrase_min_count int

Minimum bigram count.

w2v_dim int

Word2Vec vector dimension.

w2v_window int

Word2Vec context window.

w2v_min_count int

Word2Vec minimum token count.

w2v_epochs int

Word2Vec training epochs.

n_words_dim int

Top-k expanded words per dimension.

dict_restrict_vocab float | None

Restrict expansion to the top fraction of vocab.

min_similarity float

Discard expansion candidates below this cosine.

tfidf_normalize bool

L2-normalize the tf-idf vector per document.

zca_whiten bool

Apply ZCA whitening to the dimension columns.

zca_epsilon float

Numerical stabilizer for ZCA.

random_state int

Seed for Word2Vec.

dims: list[str] property

Dimension names in insertion order.

with_(**kwargs: Any) -> Config

Return a copy with the given fields overridden.

Parameters:

Name Type Description Default
**kwargs Any

Fields to override.

{}

Returns:

Type Description
Config

A new Config.

PreprocessorName

The string literal type accepted by Config.preprocessor. Valid values are "none", "static", "stanza", "corenlp", and "spacy".

Valid values for Config.preprocessor.

default_cache_dir

Returns the on-disk cache root used by download_corenlp and by the CoreNLP preprocessor's auto-install fallback. Honours the LMSY_W2V_RFS_HOME environment variable.

Return the default on-disk cache root for CoreNLP and phrase models.

Returns:

Type Description
Path

$LMSY_W2V_RFS_HOME if set, otherwise ~/.cache/lmsy_w2v_rfs.