Config¶
Config is a frozen dataclass that holds every hyperparameter the pipeline
needs. It composes the two construction phases: Phase 1 (parser-based MWE
joining and NER masking, with an optional static MWE post-pass) and Phase 2
(gensim Phrases bigram / trigram learning). Field defaults mirror the RFS
2021 replication repo's global_options.py where applicable. Use Config.with_(...)
to copy-and-override individual fields without rebuilding the whole object.
For the rationale behind the two-phase split see Two-phase preprocessing. For the per-backend trade-offs see Preprocessors.
Hyperparameters for the seed-expansion pipeline.
Two construction phases compose to build the training corpus:
- Phase 1 (preprocessor + optional static MWE post-pass). Lemmatize,
mask named entities as
[NER:TYPE], join multi-word expressions. Pick a preprocessor withpreprocessor; optionally apply a curated static MWE list as a second pass withmwe_list. - Phase 2 (gensim Phrases). Learns corpus-specific bigrams and trigrams via co-occurrence statistics. On by default.
Seeds are required and have no default. Pass any mapping of dimension name to seed words; the package is theory-agnostic.
Attributes:
| Name | Type | Description |
|---|---|---|
seeds |
dict[str, list[str]]
|
Mapping of dimension name to seed word list. Required. |
stopwords |
set[str]
|
Lowercased stopwords removed during cleaning. |
preprocessor |
PreprocessorName
|
Backend to use for Phase 1. One of
|
mwe_list |
str | Path | None
|
Curated MWE list applied AFTER the main preprocessor as
a second pass. |
spacy_model |
str
|
Name of the spaCy model when |
corenlp_memory |
str
|
JVM heap for the CoreNLP server. |
corenlp_port |
int
|
TCP port the CoreNLP server listens on. |
corenlp_timeout_ms |
int
|
Per-request CoreNLP timeout. |
n_cores |
int
|
Parallel workers for parsing and training. |
use_gensim_phrases |
bool
|
Whether to run gensim Phrases. |
phrase_passes |
int
|
Number of phrase passes (1 bigram, 2 bigram+trigram). |
phrase_threshold |
float
|
gensim Phrases score threshold. |
phrase_min_count |
int
|
Minimum bigram count. |
w2v_dim |
int
|
Word2Vec vector dimension. |
w2v_window |
int
|
Word2Vec context window. |
w2v_min_count |
int
|
Word2Vec minimum token count. |
w2v_epochs |
int
|
Word2Vec training epochs. |
n_words_dim |
int
|
Top-k expanded words per dimension. |
dict_restrict_vocab |
float | None
|
Restrict expansion to the top fraction of vocab. |
min_similarity |
float
|
Discard expansion candidates below this cosine. |
tfidf_normalize |
bool
|
L2-normalize the tf-idf vector per document. |
zca_whiten |
bool
|
Apply ZCA whitening to the dimension columns. |
zca_epsilon |
float
|
Numerical stabilizer for ZCA. |
random_state |
int
|
Seed for Word2Vec. |
PreprocessorName¶
The string literal type accepted by Config.preprocessor. Valid values are
"none", "static", "stanza", "corenlp", and "spacy".
default_cache_dir¶
Returns the on-disk cache root used by download_corenlp and by the CoreNLP
preprocessor's auto-install fallback. Honours the LMSY_W2V_RFS_HOME
environment variable.