Skip to content

Two-phase preprocessing

lmsy_w2v_rfs constructs multi-word expressions (MWEs) in two phases before Word2Vec sees a single token. Each phase catches a different class of MWE, and dropping either one measurably degrades the trained vocabulary.


The flow

flowchart LR
    RAW[Raw documents] --> P1A["Phase 1a: parser-based\nlemmatize, NER mask,\nUD MWE join"]
    P1A --> P1B["Phase 1b (optional):\nstatic MWE list\npost-pass"]
    P1B --> CLEAN["Clean:\nlowercase, drop punctuation,\ndrop SRAF stopwords"]
    CLEAN --> P2["Phase 2: gensim Phrases\nbigram pass, trigram pass\n(statistical)"]
    P2 --> W2V[Word2Vec input]

    style P1A fill:#e8f4f8,stroke:#2c7a96
    style P2 fill:#fef3e8,stroke:#c16d19

Phase 1a: parser-based, syntactic

The configured parser tokenizes, lemmatizes, tags named entities, and joins tokens linked by Universal Dependencies v2 labels fixed, flat, compound, and compound:prt. Five backends are available through Config.preprocessor:

value Needs Strength
"corenlp" (default) [corenlp] extra and Java 8+ Paper-exact; 76% syntactic MWE recall; best JVM thread scaling
"spacy" [spacy] extra and a model Fastest parser; best NER; 0% fixed or compound:prt recall
"stanza" [stanza] extra Python-native; 57% syntactic MWE recall; slowest on CPU
"static" nltk only Deterministic curated-list pass; no parser
"none" nothing Whitespace tokenize only

Lemmatization is not optional for the 2021 seed-matching logic: the seed integrity needs to match surface forms integrities and integrated, which only a lemmatizer resolves. NER masking replaces proper nouns with [NER:TYPE] placeholders so firm names like Apple cannot be promoted into a culture dictionary. preprocessor="none" skips both and should only be used when the input is already lemmatized.

Phase 1b: optional static MWE list

After the main preprocessor runs, a curated MWE list (Config.mwe_list) can join anything the parser missed. The packaged "finance" list is a hand-curated 200-entry file mixing UD fixed prepositional phrases, earnings-call jargon, and the RFS 2021 dictionary appendix. It is an example, not a default.

Phase 2: statistical, gensim Phrases

After cleaning, gensim's Phrases runs one bigram pass and (by default) one trigram pass on the corpus itself. It learns high-frequency co-occurrences that no parser will flag, because they are collocations rather than grammatical units.


What each phase catches

The two phases are complementary because they rely on different signals:

MWE Caught by Why
customer_commitment Phase 1a UD compound between two nouns
with_respect_to Phase 1a UD fixed prepositional phrase
roll_out Phase 1a UD compound:prt phrasal verb
forward_looking_statement Phase 2 High-frequency collocation, no UD label
fourth_quarter Phase 2 Domain collocation, no UD label

Phase 1a is grammar-driven and catches syntactic patterns that appear once or twice in the corpus. Phase 2 is frequency-driven and catches idiomatic phrasings that occur often enough to dominate their constituent words' co-occurrence statistics. Both run by default. The full benchmark behind this design, including NER quality and throughput numbers, lives in Preprocessor comparison.