Skip to content

Command-line interface

The package ships a lmsy-w2v-rfs console script with two subcommands.

  • run: end-to-end pipeline from documents on disk to scores on disk.
  • download-corenlp: one-time Stanford CoreNLP install.

For an end-to-end walkthrough, see Run from the command line.


run

Reads documents from --input, parses + cleans + trains Word2Vec + expands the seed dictionary + scores every document on every dimension, and writes all artifacts under --out.

Exit code 0 on success. 1 on argparse usage errors. Python exceptions bubble up with their native traceback and non-zero code.

Input flags

Flag Type Default Notes
--input, -i path required Input path; interpretation set by --input-format.
--input-format choice text One of text, csv, jsonl, directory.
--ids path None [text] Optional matching IDs file, one ID per line. If omitted, line numbers are used.
--text-col str text [csv] Column holding document text.
--id-col str id [csv] Column holding document IDs. Set to "" to use the DataFrame row index.
--text-key str text [jsonl] JSON key holding document text.
--id-key str id [jsonl] JSON key holding document IDs. Set to "" to use 1-based line numbers.
--glob-pattern str *.txt [directory] Glob for files under --input. Document IDs come from file stems.

Output flags

Flag Type Default Notes
--out, -o path runs/out Output directory. Stages are idempotent: rerunning skips stages whose outputs already exist.
--force flag off Rerun every stage regardless of existing outputs.

Seeds flag

Flag Type Default Notes
--seeds path (required) Seed dictionary file (.json or .txt). The package is theory-agnostic and has no built-in default. To reproduce the 2021 paper, dump load_example_seeds("culture_2021") to a JSON file. See Use your own seed dictionary.

Phase 1 (preprocessor) flags

Flag Type Default Notes
--preprocessor choice corenlp corenlp (paper-faithful, needs Java) / spacy (fastest) / stanza (Python-native) / static (list-only) / none.
--mwe-list str none Optional static MWE list post-pass. none skips. finance uses the packaged earnings-call example. Path loads a custom list.
--spacy-model str en_core_web_sm spaCy model name when --preprocessor=spacy. en_core_web_trf is the best-NER slower option.
--n-cores int 4 JVM threads for CoreNLP, n_process for spaCy / stanza. 4 is safe on an 8-core laptop; 8 on a workstation.

Phase 2 (gensim Phrases) flags

Flag Type Default Notes
--no-phrases flag off Skip the gensim Phrases pass entirely.
--phrase-passes int 2 1 = bigram only. 2 = bigram then trigram.
--phrase-min-count int 10 Minimum bigram count. Lower on small corpora.
--phrase-threshold float 10.0 gensim Phrases score threshold.

Word2Vec flags

Flag Type Default
--w2v-dim int 300
--w2v-window int 5
--w2v-min-count int 5
--w2v-epochs int 20

Dictionary and scoring flags

Flag Type Default Notes
--n-words-dim int 500 Top-k expanded words per dimension.
--methods list TF TFIDF WFIDF Any subset of TF, TFIDF, WFIDF, TFIDF+SIMWEIGHT, WFIDF+SIMWEIGHT.
--zca-whiten flag off Decorrelate dimension columns. See Whiten the dimension scores.

Output layout

After a successful run, --out contains:

runs/out/
├── config.json              # frozen snapshot of the Config used
├── parsed/
│   ├── sentences.txt        # preprocessor output, one sentence per line
│   └── sentence_ids.txt     # parallel IDs, one per sentence
├── cleaned/
│   └── sentences.txt        # stopwords and punctuation dropped
├── corpora/                 # only if gensim Phrases is enabled
│   ├── pass1.txt            # bigram-joined sentences
│   └── pass2.txt            # trigram-joined sentences
├── models/
│   ├── phrases_pass1.mod    # saved gensim Phrases models
│   ├── phrases_pass2.mod
│   └── w2v.mod              # saved Word2Vec model
└── outputs/
    ├── expanded_dict.csv    # one column per dimension, ranked words
    ├── scores_TF.csv        # document-level scores, one file per method
    ├── scores_TFIDF.csv
    └── scores_WFIDF.csv

Every scores_*.csv has columns: Doc_ID, one column per dimension (alphabetical), document_length.

Common invocations

Quickest: use a CSV with paper defaults.

lmsy-w2v-rfs run --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id --out runs/rfs2021

Paper-faithful CoreNLP, 8 threads, all three scoring methods:

lmsy-w2v-rfs download-corenlp             # one time
lmsy-w2v-rfs run \
  --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id \
  --out runs/corenlp8 \
  --preprocessor corenlp --n-cores 8

Fast spaCy path with custom seeds:

lmsy-w2v-rfs run \
  --input transcripts.csv --input-format csv \
  --text-col transcript --id-col call_id \
  --seeds my_seeds.txt \
  --preprocessor spacy --spacy-model en_core_web_sm --n-cores 8 \
  --out runs/spacy_custom

Directory of SEC filings, static MWE list post-pass, ZCA whitening:

lmsy-w2v-rfs run \
  --input ./10k_filings/ --input-format directory --glob-pattern "*.txt" \
  --mwe-list finance --zca-whiten \
  --out runs/10k

JSON Lines input, no parser, rely on gensim Phrases only:

lmsy-w2v-rfs run \
  --input dump.jsonl --input-format jsonl \
  --text-key body --id-key id \
  --preprocessor none --phrase-min-count 5 \
  --out runs/jsonl_demo

download-corenlp

Installs Stanford CoreNLP into $LMSY_W2V_RFS_HOME/corenlp (or ~/.cache/lmsy_w2v_rfs/corenlp if the env var is unset). Delegates to stanza.install_corenlp. Prints the install path on completion.

Required once before the first --preprocessor corenlp run. Takes no flags. Requires network access and ~1 GB free disk.

lmsy-w2v-rfs download-corenlp
# CoreNLP installed at: /Users/you/.cache/lmsy_w2v_rfs/corenlp

Getting help

Every subcommand prints its own help with --help:

lmsy-w2v-rfs --help
lmsy-w2v-rfs run --help
lmsy-w2v-rfs download-corenlp --help

The grouped output (input / output / seeds / phase 1 / phase 2 / word2vec / scoring) is the same structure used above.