Config¶

Config is a frozen dataclass that holds every hyperparameter the pipeline needs. It composes the two construction phases: Phase 1 (parser-based MWE joining and NER masking, with an optional static MWE post-pass) and Phase 2 (gensim Phrases bigram / trigram learning). Field defaults mirror the RFS 2021 replication repo's global_options.py where applicable. Use Config.with_(...) to copy-and-override individual fields without rebuilding the whole object.

For the rationale behind the two-phase split see Two-phase preprocessing. For the per-backend trade-offs see Preprocessors.

Hyperparameters for the seed-expansion pipeline.

Two construction phases compose to build the training corpus:

Phase 1 (preprocessor + optional static MWE post-pass). Lemmatize, mask named entities as [NER:TYPE], join multi-word expressions. Pick a preprocessor with preprocessor; optionally apply a curated static MWE list as a second pass with mwe_list.
Phase 2 (gensim Phrases). Learns corpus-specific bigrams and trigrams via co-occurrence statistics. On by default.

Seeds are required and have no default. Pass any mapping of dimension name to seed words; the package is theory-agnostic.

Attributes:

Name	Type	Description
`seeds`	`dict[str, list[str]]`	Mapping of dimension name to seed word list. Required.
`stopwords`	`set[str]`	Lowercased stopwords removed during cleaning.
`preprocessor`	`PreprocessorName`	Backend to use for Phase 1. One of `none` (default; whitespace split + lowercase only; no extra dependencies, so a bare `pip install` runs out of the box), `spacy` (recommended for richer parsing: lemmatization, NER masking, dependency MWEs; needs the `[spacy]` extra + a model), `corenlp` (paper-exact reproduction; CoreNLP server via stanza.server; needs Java and the `[corenlp]` extra), `stanza` (stanza.Pipeline; Python-native, slowest), `static` (NLTK MWETokenizer with a curated list; Java-free).
`mwe_list`	`str \| Path \| None`	Curated MWE list applied AFTER the main preprocessor as a second pass. `"finance"` for the packaged list, a path for your own, or `None` (default) to skip.
`spacy_model`	`str`	Name of the spaCy model when `preprocessor="spacy"`.
`corenlp_memory`	`str`	JVM heap for the CoreNLP server.
`corenlp_port`	`int`	TCP port the CoreNLP server listens on.
`corenlp_timeout_ms`	`int`	Per-request CoreNLP timeout.
`corenlp_max_char_length`	`int`	Max characters per document the CoreNLP server accepts. Raise for very long transcripts.
`corenlp_properties`	`dict[str, Any]`	Extra CoreNLP server properties merged on top of (and able to override) the pinned reproducibility defaults.
`parse_chunk_size`	`int`	If > 0, documents are preprocessed in batches of this size to cap peak memory on large corpora; 0 processes all.
`n_cores`	`int`	Parallel workers for parsing and training.
`use_gensim_phrases`	`bool`	Whether to run gensim Phrases.
`phrase_passes`	`int`	Number of phrase passes (1 bigram, 2 bigram+trigram).
`phrase_threshold`	`float`	gensim Phrases score threshold.
`phrase_min_count`	`int`	Minimum bigram count.
`w2v_dim`	`int`	Word2Vec vector dimension.
`w2v_window`	`int`	Word2Vec context window.
`w2v_min_count`	`int`	Word2Vec minimum token count.
`w2v_epochs`	`int`	Word2Vec training epochs.
`w2v_sg`	`int`	Word2Vec training algorithm: `0` for CBOW (the default, matching the original Li et al. 2021 implementation, which used gensim's default; the paper does not specify the architecture), `1` for skip-gram.
`w2v_extra`	`dict[str, Any]`	Extra keyword arguments forwarded verbatim to `gensim.models.Word2Vec` (e.g. `negative`, `hs`, `sample`, `ns_exponent`). Keys here override the named fields.
`phrase_extra`	`dict[str, Any]`	Extra keyword arguments forwarded to `gensim.models.Phrases` (e.g. `scoring="npmi"`).
`n_words_dim`	`int`	Top-k expanded words per dimension.
`dict_restrict_vocab`	`float \| None`	Restrict expansion to the top fraction of vocab.
`min_similarity`	`float`	Discard expansion candidates below this cosine.
`tfidf_normalize`	`bool`	L2-normalize the tf-idf vector per document.
`zca_whiten`	`bool`	Apply ZCA whitening to the dimension columns.
`zca_epsilon`	`float`	Numerical stabilizer for ZCA.
`random_state`	`int`	Seed for Word2Vec.

`dims: list[str]` `property` ¶

Dimension names in insertion order.

`with_(**kwargs: Any) -> Config` ¶

Return a copy with the given fields overridden.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Fields to override.	`{}`

Returns:

Type	Description
`Config`	A new `Config`.

PreprocessorName¶

The string literal type accepted by Config.preprocessor. Valid values are "none", "static", "stanza", "corenlp", and "spacy".

Valid values for Config.preprocessor.

default_cache_dir¶

Returns the on-disk cache root used by download_corenlp and by the CoreNLP preprocessor's auto-install fallback. Honours the LMSY_W2V_RFS_HOME environment variable.

Return the default on-disk cache root for CoreNLP and phrase models.