Preprocessors¶

Phase 1 is pluggable. Each backend turns a raw document into a list of sentences, where each sentence is a list of lemmatized tokens with multi-word expressions joined by _ and named-entity spans replaced with [NER:TYPE] placeholders. The Pipeline.parse stage selects a backend through Config.preprocessor and instantiates it via build_preprocessor. An optional curated MWE list (Config.mwe_list) is applied as a second pass via apply_mwe_list.

Design background and benchmark data live in Two-phase preprocessing.

The Preprocessor protocol¶

Any backend that implements process(text) -> list[list[str]] is a valid preprocessor. Backends that benefit from concurrency can also override process_documents.

Bases: Protocol

Phase-1 preprocessor contract.

Implementations turn a raw document into a list of preprocessed sentences. Each sentence is a list of tokens. Tokens may contain underscores for multi-word expressions and the literal string [NER:TYPE] as an entity placeholder. The pipeline's cleaner drops punctuation-only and stopword tokens downstream; the preprocessor does not have to.

Implementations supply process (one doc) and optionally override process_documents (batch) for concurrent processing. The default batch implementation just loops over process.

`process(text: str) -> list[list[str]]` ¶

Parse one document.

Parameters:

Name	Type	Description	Default
`text`	`str`	A raw document string. May contain newlines.	required

Returns:

Type	Description
`list[list[str]]`	A list of sentences, each a list of tokens.

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶

Parse a stream of documents, possibly concurrently.

The default implementation loops serially. Backends that benefit from concurrency (CoreNLP with a JVM thread pool, spaCy with nlp.pipe(n_process=N)) override this to unlock real throughput.

Parameters:

Name	Type	Description	Default
`texts`	`Iterable[str]`	Iterable of raw document strings.	required

Yields:

Type	Description
`list[list[str]]`	One preprocessed document at a time, in input order.

build_preprocessor¶

Instantiate the preprocessor named in config.preprocessor.

Parameters:

Name	Type	Description	Default
`config`	`'Config'`	Pipeline config.	required

Returns:

Type	Description
`Preprocessor`	A preprocessor instance.

Raises:

Type	Description
`ImportError`	If the chosen backend's optional extra is not installed.
`ValueError`	If `config.preprocessor` is not a known name.

NoOpPreprocessor¶

Whitespace split, lowercase, no parse, no NER. Fastest path. Useful for tests and for users who only want the gensim Phrases + Word2Vec half of the pipeline.

Trivial preprocessor: split on sentence-ending punctuation, lowercase.

Fastest possible path. Useful for quick iteration, for tests, and for users who only want the gensim Phrases + Word2Vec half of the pipeline with a curated static MWE list applied afterwards.

`process(text: str) -> list[list[str]]` ¶

Split a document into whitespace-tokenized, lowercased sentences.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw document.	required

Returns:

Type	Description
`list[list[str]]`	List of sentences, each a list of tokens.

StaticMWEPreprocessor¶

No parser, no NER. Splits on sentences and applies NLTK MWETokenizer with a curated MWE list (the packaged "finance" list or a user-supplied path). Deterministic, Java-free, zero-ML.

Whitespace tokenize + NLTK MWE concatenation.

Attributes:

Name	Type	Description
`mwe_list`		Loaded MWE tuples used for tokenization.
`lowercase`		Whether to lowercase tokens before matching.

`init(mwe_source: str | Path = 'finance', lowercase: bool = True) -> None` ¶

Initialize.

Parameters:

Name	Type	Description	Default
`mwe_source`	`str \| Path`	`"finance"` for the packaged list, or a path.	`'finance'`
`lowercase`	`bool`	Whether to lowercase tokens before matching.	`True`

`process(text: str) -> list[list[str]]` ¶

Tokenize and apply the MWE list.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw document.	required

Returns:

Type	Description
`list[list[str]]`	List of sentences, each a list of tokens (MWEs joined by `_`).

StanzaPreprocessor¶

Neural stanza pipeline (tokenize, pos, lemma, depparse, ner) without Java. Slowest of the three parser-based backends on CPU. Produces the largest vocabulary because stanza's NER model is more type-fine-grained than CoreNLP.

stanza.Pipeline-based preprocessor.

Attributes:

Name	Type	Description
`nlp`		Loaded stanza Pipeline.

`init(config: 'Config') -> None` ¶

Load the stanza pipeline.

Parameters:

Name	Type	Description	Default
`config`	`'Config'`	Pipeline config.	required

Raises:

Type	Description
`ImportError`	If stanza is not installed.

`process(text: str) -> list[list[str]]` ¶

Parse one document.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw document.	required

Returns:

Type	Description
`list[list[str]]`	List of sentences, each a list of tokens.

CoreNLPPreprocessor¶

Paper-exact reproduction path. Holds a warm Stanford CoreNLP JVM open via stanza.server.CoreNLPClient and fans requests across its thread pool. Use as a context manager to guarantee server shutdown. Requires Java 8+ and the [corenlp] extra.

CoreNLP-server-based preprocessor.

The client is started when the first document is processed and kept warm until close() is called or the instance is garbage collected. Use as a context manager (with CoreNLPPreprocessor(cfg) as pp:) to guarantee clean shutdown.

`init(config: 'Config') -> None` ¶

Stand up the CoreNLP client.

Parameters:

Name	Type	Description	Default
`config`	`'Config'`	Pipeline config.	required

Raises:

Type	Description
`ImportError`	If the corenlp extra is not installed.
`RuntimeError`	If Java is not on PATH.

`close() -> None` ¶

Shut down the CoreNLP server.

`process(text: str) -> list[list[str]]` ¶

Parse one document.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw document.	required

Returns:

Type	Description
`list[list[str]]`	List of sentences, each a list of tokens.

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶

Fan out annotate requests across the JVM thread pool.

The CoreNLP server has a thread pool of size config.n_cores. Sending requests serially leaves those threads idle. This override submits n_cores requests in flight via a Python ThreadPoolExecutor; the JVM processes them concurrently.

Parameters:

Name	Type	Description	Default
`texts`	`Iterable[str]`	Iterable of raw documents.	required

Yields:

Type	Description
`list[list[str]]`	Preprocessed documents in input order.

SpacyPreprocessor¶

Parses, lemmatizes, masks entities, and joins fixed / flat / compound dependency pairs. On the 150-document bakeoff it was 9x faster than stanza and 2x faster than CoreNLP, with the cleanest NER output and the smallest Word2Vec-ready vocabulary. Use when Java is not available and parse speed matters; "corenlp" is the package default for paper-faithful results.

spaCy-based preprocessor.

Attributes:

Name	Type	Description
`nlp`		Loaded spaCy `Language`.
`model_name`		Name of the loaded model.

`init(config: 'Config') -> None` ¶

Load the spaCy model named in config.spacy_model.

Parameters:

Name	Type	Description	Default
`config`	`'Config'`	Pipeline config.	required

Raises:

Type	Description
`ImportError`	If spaCy is not installed.
`OSError`	If the requested model is not downloaded.

`process(text: str) -> list[list[str]]` ¶

Parse one document.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw document.	required

Returns:

Type	Description
`list[list[str]]`	List of sentences, each a list of tokens.

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶

Fan out via nlp.pipe(n_process=n_cores).

spaCy's native pipe is the right way to process many documents: it batches on the C side and, with n_process>1, uses Python multiprocessing. We apply the thread-oversubscription fix (torch.set_num_threads(1)) in __init__ so PyTorch under each worker does not fight for cores.

Parameters:

Name	Type	Description	Default
`texts`	`Iterable[str]`	Iterable of raw documents.	required

Yields:

Type	Description
`list[list[str]]`	Preprocessed documents in input order.

load_mwe_list¶

Loads a curated MWE list, either the packaged "finance" list or a newline-delimited file at a user-supplied path.

Load a curated MWE list from the packaged data/ or a file path.

Parameters:

Name	Type	Description	Default
`source`	`str \| Path`	Either `"finance"` for the packaged finance list, or a filesystem path to a newline-delimited list.	required

Returns:

Type	Description
`list[tuple[str, ...]]`	List of tuples, each a token sequence for NLTK `MWETokenizer`.

Raises:

Type	Description
`FileNotFoundError`	If the source cannot be resolved.

apply_mwe_list¶

Applies NLTK MWETokenizer to each sentence as a post-pass. Splits around [NER:*] placeholders so MWE patterns do not match across entity boundaries.

Apply NLTK MWETokenizer to each sentence as a post-pass.

Designed to run AFTER the main preprocessor, so MWE patterns the parser missed (for example, customer_commitment) still get joined.

Parameters:

Name	Type	Description	Default
`sentences`	`Iterable[list[str]]`	Sentences of tokens.	required
`mwe_list`	`list[tuple[str, ...]] \| None`	Loaded MWE tuples. If `None` or empty, return input unchanged.	required

Returns:

Type	Description
`list[list[str]]`	Sentences with matching MWEs joined by `_`.

Preprocessors¶

The Preprocessor protocol¶

process(text: str) -> list[list[str]] ¶

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]] ¶

build_preprocessor¶

NoOpPreprocessor¶

process(text: str) -> list[list[str]] ¶

StaticMWEPreprocessor¶

__init__(mwe_source: str | Path = 'finance', lowercase: bool = True) -> None ¶

process(text: str) -> list[list[str]] ¶

StanzaPreprocessor¶

__init__(config: 'Config') -> None ¶

process(text: str) -> list[list[str]] ¶

CoreNLPPreprocessor¶

__init__(config: 'Config') -> None ¶

close() -> None ¶

process(text: str) -> list[list[str]] ¶

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]] ¶

SpacyPreprocessor¶

__init__(config: 'Config') -> None ¶

process(text: str) -> list[list[str]] ¶

process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]] ¶

load_mwe_list¶

apply_mwe_list¶

`process(text: str) -> list[list[str]]` ¶

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶

`process(text: str) -> list[list[str]]` ¶

`init(mwe_source: str | Path = 'finance', lowercase: bool = True) -> None` ¶

`process(text: str) -> list[list[str]]` ¶

`init(config: 'Config') -> None` ¶

`process(text: str) -> list[list[str]]` ¶

`init(config: 'Config') -> None` ¶

`close() -> None` ¶

`process(text: str) -> list[list[str]]` ¶

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶

`init(config: 'Config') -> None` ¶

`process(text: str) -> list[list[str]]` ¶

`process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]` ¶