Skip to content

Preprocessors

Phase 1 is pluggable. Each backend turns a raw document into a list of sentences, where each sentence is a list of lemmatized tokens with multi-word expressions joined by _ and named-entity spans replaced with [NER:TYPE] placeholders. The Pipeline.parse stage selects a backend through Config.preprocessor and instantiates it via build_preprocessor. An optional curated MWE list (Config.mwe_list) is applied as a second pass via apply_mwe_list.

Design background and benchmark data live in Two-phase preprocessing.

The Preprocessor protocol

Any backend that implements process(text) -> list[list[str]] is a valid preprocessor. Backends that benefit from concurrency can also override process_documents.

Bases: Protocol

Phase-1 preprocessor contract.

Implementations turn a raw document into a list of preprocessed sentences. Each sentence is a list of tokens. Tokens may contain underscores for multi-word expressions and the literal string [NER:TYPE] as an entity placeholder. The pipeline's cleaner drops punctuation-only and stopword tokens downstream; the preprocessor does not have to.

Implementations supply process (one doc) and optionally override process_documents (batch) for concurrent processing. The default batch implementation just loops over process.

process(text: str) -> list[list[str]]

Parse one document.

Parameters:

Name Type Description Default
text str

A raw document string. May contain newlines.

required

Returns:

Type Description
list[list[str]]

A list of sentences, each a list of tokens.

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]

Parse a stream of documents, possibly concurrently.

The default implementation loops serially. Backends that benefit from concurrency (CoreNLP with a JVM thread pool, spaCy with nlp.pipe(n_process=N)) override this to unlock real throughput.

Parameters:

Name Type Description Default
texts Iterable[str]

Iterable of raw document strings.

required

Yields:

Type Description
list[list[str]]

One preprocessed document at a time, in input order.

build_preprocessor

Instantiate the preprocessor named in config.preprocessor.

Parameters:

Name Type Description Default
config 'Config'

Pipeline config.

required

Returns:

Type Description
Preprocessor

A preprocessor instance.

Raises:

Type Description
ImportError

If the chosen backend's optional extra is not installed.

ValueError

If config.preprocessor is not a known name.

NoOpPreprocessor

Whitespace split, lowercase, no parse, no NER. Fastest path. Useful for tests and for users who only want the gensim Phrases + Word2Vec half of the pipeline.

Trivial preprocessor: split on sentence-ending punctuation, lowercase.

Fastest possible path. Useful for quick iteration, for tests, and for users who only want the gensim Phrases + Word2Vec half of the pipeline with a curated static MWE list applied afterwards.

process(text: str) -> list[list[str]]

Split a document into whitespace-tokenized, lowercased sentences.

Parameters:

Name Type Description Default
text str

Raw document.

required

Returns:

Type Description
list[list[str]]

List of sentences, each a list of tokens.

StaticMWEPreprocessor

No parser, no NER. Splits on sentences and applies NLTK MWETokenizer with a curated MWE list (the packaged "finance" list or a user-supplied path). Deterministic, Java-free, zero-ML.

Whitespace tokenize + NLTK MWE concatenation.

Attributes:

Name Type Description
mwe_list

Loaded MWE tuples used for tokenization.

lowercase

Whether to lowercase tokens before matching.

__init__(mwe_source: str | Path = 'finance', lowercase: bool = True) -> None

Initialize.

Parameters:

Name Type Description Default
mwe_source str | Path

"finance" for the packaged list, or a path.

'finance'
lowercase bool

Whether to lowercase tokens before matching.

True

process(text: str) -> list[list[str]]

Tokenize and apply the MWE list.

Parameters:

Name Type Description Default
text str

Raw document.

required

Returns:

Type Description
list[list[str]]

List of sentences, each a list of tokens (MWEs joined by _).

StanzaPreprocessor

Neural stanza pipeline (tokenize, pos, lemma, depparse, ner) without Java. Slowest of the three parser-based backends on CPU. Produces the largest vocabulary because stanza's NER model is more type-fine-grained than CoreNLP.

stanza.Pipeline-based preprocessor.

Attributes:

Name Type Description
nlp

Loaded stanza Pipeline.

__init__(config: 'Config') -> None

Load the stanza pipeline.

Parameters:

Name Type Description Default
config 'Config'

Pipeline config.

required

Raises:

Type Description
ImportError

If stanza is not installed.

process(text: str) -> list[list[str]]

Parse one document.

Parameters:

Name Type Description Default
text str

Raw document.

required

Returns:

Type Description
list[list[str]]

List of sentences, each a list of tokens.

CoreNLPPreprocessor

Paper-exact reproduction path. Holds a warm Stanford CoreNLP JVM open via stanza.server.CoreNLPClient and fans requests across its thread pool. Use as a context manager to guarantee server shutdown. Requires Java 8+ and the [corenlp] extra.

CoreNLP-server-based preprocessor.

The client is started when the first document is processed and kept warm until close() is called or the instance is garbage collected. Use as a context manager (with CoreNLPPreprocessor(cfg) as pp:) to guarantee clean shutdown.

__init__(config: 'Config') -> None

Stand up the CoreNLP client.

Parameters:

Name Type Description Default
config 'Config'

Pipeline config.

required

Raises:

Type Description
ImportError

If the corenlp extra is not installed.

RuntimeError

If Java is not on PATH.

close() -> None

Shut down the CoreNLP server.

process(text: str) -> list[list[str]]

Parse one document.

Parameters:

Name Type Description Default
text str

Raw document.

required

Returns:

Type Description
list[list[str]]

List of sentences, each a list of tokens.

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]

Fan out annotate requests across the JVM thread pool.

The CoreNLP server has a thread pool of size config.n_cores. Sending requests serially leaves those threads idle. This override submits n_cores requests in flight via a Python ThreadPoolExecutor; the JVM processes them concurrently.

Parameters:

Name Type Description Default
texts Iterable[str]

Iterable of raw documents.

required

Yields:

Type Description
list[list[str]]

Preprocessed documents in input order.

SpacyPreprocessor

Parses, lemmatizes, masks entities, and joins fixed / flat / compound dependency pairs. On the 150-document bakeoff it was 9x faster than stanza and 2x faster than CoreNLP, with the cleanest NER output and the smallest Word2Vec-ready vocabulary. Use when Java is not available and parse speed matters; "corenlp" is the package default for paper-faithful results.

spaCy-based preprocessor.

Attributes:

Name Type Description
nlp

Loaded spaCy Language.

model_name

Name of the loaded model.

__init__(config: 'Config') -> None

Load the spaCy model named in config.spacy_model.

Parameters:

Name Type Description Default
config 'Config'

Pipeline config.

required

Raises:

Type Description
ImportError

If spaCy is not installed.

OSError

If the requested model is not downloaded.

process(text: str) -> list[list[str]]

Parse one document.

Parameters:

Name Type Description Default
text str

Raw document.

required

Returns:

Type Description
list[list[str]]

List of sentences, each a list of tokens.

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]

Fan out via nlp.pipe(n_process=n_cores).

spaCy's native pipe is the right way to process many documents: it batches on the C side and, with n_process>1, uses Python multiprocessing. We apply the thread-oversubscription fix (torch.set_num_threads(1)) in __init__ so PyTorch under each worker does not fight for cores.

Parameters:

Name Type Description Default
texts Iterable[str]

Iterable of raw documents.

required

Yields:

Type Description
list[list[str]]

Preprocessed documents in input order.

load_mwe_list

Loads a curated MWE list, either the packaged "finance" list or a newline-delimited file at a user-supplied path.

Load a curated MWE list from the packaged data/ or a file path.

Parameters:

Name Type Description Default
source str | Path

Either "finance" for the packaged finance list, or a filesystem path to a newline-delimited list.

required

Returns:

Type Description
list[tuple[str, ...]]

List of tuples, each a token sequence for NLTK MWETokenizer.

Raises:

Type Description
FileNotFoundError

If the source cannot be resolved.

apply_mwe_list

Applies NLTK MWETokenizer to each sentence as a post-pass. Splits around [NER:*] placeholders so MWE patterns do not match across entity boundaries.

Apply NLTK MWETokenizer to each sentence as a post-pass.

Designed to run AFTER the main preprocessor, so MWE patterns the parser missed (for example, customer_commitment) still get joined.

Parameters:

Name Type Description Default
sentences Iterable[list[str]]

Sentences of tokens.

required
mwe_list list[tuple[str, ...]] | None

Loaded MWE tuples. If None or empty, return input unchanged.

required

Returns:

Type Description
list[list[str]]

Sentences with matching MWEs joined by _.