Skip to content

Word2Vec

Thin wrappers around gensim 4's Word2Vec. train_word2vec streams the corpus from disk via PathLineSentences so the full text never has to fit in RAM. All hyperparameters (vector size, window, min count, epochs, worker count, random seed) come from the Config dataclass. Pipeline.train calls train_word2vec when no saved model exists and load_word2vec otherwise.


train_word2vec

Fits a gensim.models.Word2Vec on a sentence file and saves the result to disk. The corpus is streamed line by line, so the full training set never has to fit in RAM.

Example:

from pathlib import Path
from lmsy_w2v_rfs import Config, load_example_seeds
from lmsy_w2v_rfs.w2v import train_word2vec

seeds = load_example_seeds("culture_2021")
cfg = Config(seeds=seeds, preprocessor="none", w2v_dim=100, w2v_epochs=5)
model = train_word2vec(
    sentences_path=Path("runs/demo/corpora/pass2.txt"),
    model_path=Path("runs/demo/models/w2v.mod"),
    config=cfg,
)
print(model.wv.most_similar("innovation", topn=5))

Train a Word2Vec model and save it to disk.

Parameters:

Name Type Description Default
sentences_path Path | str

Input corpus, one sentence per line.

required
model_path Path | str

Destination .mod path.

required
config Config

Pipeline config.

required

Returns:

Type Description
Word2Vec

The trained Word2Vec model.


load_word2vec

Loads a model saved by train_word2vec. Use this to restore a model from a previous run without retraining.

Example:

from pathlib import Path
from lmsy_w2v_rfs.w2v import load_word2vec

model = load_word2vec(Path("runs/demo/models/w2v.mod"))
print(model.wv["innovation"])   # 100-dim vector

Load a saved Word2Vec model.

Parameters:

Name Type Description Default
model_path Path | str

Path produced by train_word2vec.

required

Returns:

Type Description
Word2Vec

The loaded model.