Command-line interface¶

The package ships a lmsy-w2v-rfs console script with two subcommands.

run: end-to-end pipeline from documents on disk to scores on disk.
download-corenlp: one-time Stanford CoreNLP install.

For an end-to-end walkthrough, see Run from the command line.

`run`¶

Reads documents from --input, parses + cleans + trains Word2Vec + expands the seed dictionary + scores every document on every dimension, and writes all artifacts under --out.

Exit code 0 on success. 1 on argparse usage errors. Python exceptions bubble up with their native traceback and non-zero code.

Input flags¶

Flag	Type	Default	Notes
`--input`, `-i`	path	required	Input path; interpretation set by `--input-format`.
`--input-format`	choice	`text`	One of `text`, `csv`, `jsonl`, `directory`.
`--ids`	path	`None`	[text] Optional matching IDs file, one ID per line. If omitted, line numbers are used.
`--text-col`	str	`text`	[csv] Column holding document text.
`--id-col`	str	`id`	[csv] Column holding document IDs. Set to `""` to use the DataFrame row index.
`--text-key`	str	`text`	[jsonl] JSON key holding document text.
`--id-key`	str	`id`	[jsonl] JSON key holding document IDs. Set to `""` to use 1-based line numbers.
`--glob-pattern`	str	`*.txt`	[directory] Glob for files under `--input`. Document IDs come from file stems.

Output flags¶

Flag	Type	Default	Notes
`--out`, `-o`	path	`runs/out`	Output directory. Stages are idempotent: rerunning skips stages whose outputs already exist.
`--force`	flag	off	Rerun every stage regardless of existing outputs.

Seeds flag¶

Flag	Type	Default	Notes
`--seeds`	path	(required)	Seed dictionary file (`.json` or `.txt`). The package is theory-agnostic and has no built-in default. To reproduce the 2021 paper, dump `load_example_seeds("culture_2021")` to a JSON file. See Use your own seed dictionary.

Phase 1 (preprocessor) flags¶

Flag	Type	Default	Notes
`--preprocessor`	choice	`corenlp`	`corenlp` (paper-faithful, needs Java) / `spacy` (fastest) / `stanza` (Python-native) / `static` (list-only) / `none`.
`--mwe-list`	str	`none`	Optional static MWE list post-pass. `none` skips. `finance` uses the packaged earnings-call example. Path loads a custom list.
`--spacy-model`	str	`en_core_web_sm`	spaCy model name when `--preprocessor=spacy`. `en_core_web_trf` is the best-NER slower option.
`--n-cores`	int	`4`	JVM threads for CoreNLP, `n_process` for spaCy / stanza. 4 is safe on an 8-core laptop; 8 on a workstation.

Phase 2 (gensim Phrases) flags¶

Flag	Type	Default	Notes
`--no-phrases`	flag	off	Skip the gensim `Phrases` pass entirely.
`--phrase-passes`	int	`2`	1 = bigram only. 2 = bigram then trigram.
`--phrase-min-count`	int	`10`	Minimum bigram count. Lower on small corpora.
`--phrase-threshold`	float	`10.0`	gensim `Phrases` score threshold.

Word2Vec flags¶

Flag	Type	Default
`--w2v-dim`	int	`300`
`--w2v-window`	int	`5`
`--w2v-min-count`	int	`5`
`--w2v-epochs`	int	`20`

Dictionary and scoring flags¶

Flag	Type	Default	Notes
`--n-words-dim`	int	`500`	Top-k expanded words per dimension.
`--methods`	list	`TF TFIDF WFIDF`	Any subset of `TF`, `TFIDF`, `WFIDF`, `TFIDF+SIMWEIGHT`, `WFIDF+SIMWEIGHT`.
`--zca-whiten`	flag	off	Decorrelate dimension columns. See Whiten the dimension scores.

Output layout¶

After a successful run, --out contains:

runs/out/
├── config.json              # frozen snapshot of the Config used
├── parsed/
│   ├── sentences.txt        # preprocessor output, one sentence per line
│   └── sentence_ids.txt     # parallel IDs, one per sentence
├── cleaned/
│   └── sentences.txt        # stopwords and punctuation dropped
├── corpora/                 # only if gensim Phrases is enabled
│   ├── pass1.txt            # bigram-joined sentences
│   └── pass2.txt            # trigram-joined sentences
├── models/
│   ├── phrases_pass1.mod    # saved gensim Phrases models
│   ├── phrases_pass2.mod
│   └── w2v.mod              # saved Word2Vec model
└── outputs/
    ├── expanded_dict.csv    # one column per dimension, ranked words
    ├── scores_TF.csv        # document-level scores, one file per method
    ├── scores_TFIDF.csv
    └── scores_WFIDF.csv

Every scores_*.csv has columns: Doc_ID, one column per dimension (alphabetical), document_length.

Common invocations¶

Quickest: use a CSV with paper defaults.

lmsy-w2v-rfs run --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id --out runs/rfs2021

Paper-faithful CoreNLP, 8 threads, all three scoring methods:

lmsy-w2v-rfs download-corenlp             # one time
lmsy-w2v-rfs run \
  --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id \
  --out runs/corenlp8 \
  --preprocessor corenlp --n-cores 8

Fast spaCy path with custom seeds:

lmsy-w2v-rfs run \
  --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id \
  --seeds my_seeds.txt \
  --preprocessor spacy --spacy-model en_core_web_sm --n-cores 8 \
  --out runs/spacy_custom

Directory of SEC filings, static MWE list post-pass, ZCA whitening:

lmsy-w2v-rfs run \
  --input ./10k_filings/ --input-format directory --glob-pattern "*.txt" \
  --mwe-list finance --zca-whiten \
  --out runs/10k

JSON Lines input, no parser, rely on gensim Phrases only:

lmsy-w2v-rfs run \
  --input dump.jsonl --input-format jsonl \
  --text-key body --id-key id \
  --preprocessor none --phrase-min-count 5 \
  --out runs/jsonl_demo

`download-corenlp`¶

Installs Stanford CoreNLP into $LMSY_W2V_RFS_HOME/corenlp (or ~/.cache/lmsy_w2v_rfs/corenlp if the env var is unset). Delegates to stanza.install_corenlp. Prints the install path on completion.

Required once before the first --preprocessor corenlp run. Takes no flags. Requires network access and ~1 GB free disk.

lmsy-w2v-rfs download-corenlp
# CoreNLP installed at: /Users/you/.cache/lmsy_w2v_rfs/corenlp

Getting help¶

Every subcommand prints its own help with --help:

lmsy-w2v-rfs --help
lmsy-w2v-rfs run --help
lmsy-w2v-rfs download-corenlp --help

The grouped output (input / output / seeds / phase 1 / phase 2 / word2vec / scoring) is the same structure used above.