Preprocessors¶
Phase 1 is pluggable. Each backend turns a raw document into a list of sentences,
where each sentence is a list of lemmatized tokens with multi-word expressions
joined by _ and named-entity spans replaced with [NER:TYPE] placeholders.
The Pipeline.parse stage selects a backend through Config.preprocessor and
instantiates it via build_preprocessor. An optional
curated MWE list (Config.mwe_list) is applied as a second pass via
apply_mwe_list.
Design background and benchmark data live in Two-phase preprocessing.
The Preprocessor protocol¶
Any backend that implements process(text) -> list[list[str]] is a valid
preprocessor. Backends that benefit from concurrency can also override
process_documents.
Bases: Protocol
Phase-1 preprocessor contract.
Implementations turn a raw document into a list of preprocessed sentences.
Each sentence is a list of tokens. Tokens may contain underscores for
multi-word expressions and the literal string [NER:TYPE] as an entity
placeholder. The pipeline's cleaner drops punctuation-only and stopword
tokens downstream; the preprocessor does not have to.
Implementations supply process (one doc) and optionally override
process_documents (batch) for concurrent processing. The default
batch implementation just loops over process.
process(text: str) -> list[list[str]]
¶
Parse one document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
A raw document string. May contain newlines. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
A list of sentences, each a list of tokens. |
process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]
¶
Parse a stream of documents, possibly concurrently.
The default implementation loops serially. Backends that benefit
from concurrency (CoreNLP with a JVM thread pool, spaCy with
nlp.pipe(n_process=N)) override this to unlock real throughput.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Iterable[str]
|
Iterable of raw document strings. |
required |
Yields:
| Type | Description |
|---|---|
list[list[str]]
|
One preprocessed document at a time, in input order. |
build_preprocessor¶
Instantiate the preprocessor named in config.preprocessor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
'Config'
|
Pipeline config. |
required |
Returns:
| Type | Description |
|---|---|
Preprocessor
|
A preprocessor instance. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If the chosen backend's optional extra is not installed. |
ValueError
|
If |
NoOpPreprocessor¶
Whitespace split, lowercase, no parse, no NER. Fastest path. Useful for tests
and for users who only want the gensim Phrases + Word2Vec half of the pipeline.
Trivial preprocessor: split on sentence-ending punctuation, lowercase.
Fastest possible path. Useful for quick iteration, for tests, and for
users who only want the gensim Phrases + Word2Vec half of the
pipeline with a curated static MWE list applied afterwards.
process(text: str) -> list[list[str]]
¶
Split a document into whitespace-tokenized, lowercased sentences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw document. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of sentences, each a list of tokens. |
StaticMWEPreprocessor¶
No parser, no NER. Splits on sentences and applies NLTK MWETokenizer with a
curated MWE list (the packaged "finance" list or a user-supplied path).
Deterministic, Java-free, zero-ML.
Whitespace tokenize + NLTK MWE concatenation.
Attributes:
| Name | Type | Description |
|---|---|---|
mwe_list |
Loaded MWE tuples used for tokenization. |
|
lowercase |
Whether to lowercase tokens before matching. |
__init__(mwe_source: str | Path = 'finance', lowercase: bool = True) -> None
¶
Initialize.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mwe_source
|
str | Path
|
|
'finance'
|
lowercase
|
bool
|
Whether to lowercase tokens before matching. |
True
|
process(text: str) -> list[list[str]]
¶
Tokenize and apply the MWE list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw document. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of sentences, each a list of tokens (MWEs joined by |
StanzaPreprocessor¶
Neural stanza pipeline (tokenize, pos, lemma, depparse, ner) without Java. Slowest of the three parser-based backends on CPU. Produces the largest vocabulary because stanza's NER model is more type-fine-grained than CoreNLP.
stanza.Pipeline-based preprocessor.
Attributes:
| Name | Type | Description |
|---|---|---|
nlp |
Loaded stanza Pipeline. |
__init__(config: 'Config') -> None
¶
Load the stanza pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
'Config'
|
Pipeline config. |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If stanza is not installed. |
process(text: str) -> list[list[str]]
¶
Parse one document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw document. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of sentences, each a list of tokens. |
CoreNLPPreprocessor¶
Paper-exact reproduction path. Holds a warm Stanford CoreNLP JVM open via
stanza.server.CoreNLPClient and fans requests across its thread pool. Use as
a context manager to guarantee server shutdown. Requires Java 8+ and the
[corenlp] extra.
CoreNLP-server-based preprocessor.
The client is started when the first document is processed and kept
warm until close() is called or the instance is garbage collected.
Use as a context manager (with CoreNLPPreprocessor(cfg) as pp:) to
guarantee clean shutdown.
__init__(config: 'Config') -> None
¶
Stand up the CoreNLP client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
'Config'
|
Pipeline config. |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If the corenlp extra is not installed. |
RuntimeError
|
If Java is not on PATH. |
close() -> None
¶
Shut down the CoreNLP server.
process(text: str) -> list[list[str]]
¶
Parse one document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw document. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of sentences, each a list of tokens. |
process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]
¶
Fan out annotate requests across the JVM thread pool.
The CoreNLP server has a thread pool of size config.n_cores.
Sending requests serially leaves those threads idle. This override
submits n_cores requests in flight via a Python
ThreadPoolExecutor; the JVM processes them concurrently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Iterable[str]
|
Iterable of raw documents. |
required |
Yields:
| Type | Description |
|---|---|
list[list[str]]
|
Preprocessed documents in input order. |
SpacyPreprocessor¶
Parses, lemmatizes, masks entities, and joins fixed / flat / compound
dependency pairs. On the 150-document bakeoff it was 9x faster than stanza and
2x faster than CoreNLP, with the cleanest NER output and the smallest
Word2Vec-ready vocabulary. Use when Java is not available and parse speed
matters; "corenlp" is the package default for paper-faithful results.
spaCy-based preprocessor.
Attributes:
| Name | Type | Description |
|---|---|---|
nlp |
Loaded spaCy |
|
model_name |
Name of the loaded model. |
__init__(config: 'Config') -> None
¶
Load the spaCy model named in config.spacy_model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
'Config'
|
Pipeline config. |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If spaCy is not installed. |
OSError
|
If the requested model is not downloaded. |
process(text: str) -> list[list[str]]
¶
Parse one document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw document. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
List of sentences, each a list of tokens. |
process_documents(texts: Iterable[str]) -> Iterator[list[list[str]]]
¶
Fan out via nlp.pipe(n_process=n_cores).
spaCy's native pipe is the right way to process many documents:
it batches on the C side and, with n_process>1, uses Python
multiprocessing. We apply the thread-oversubscription fix
(torch.set_num_threads(1)) in __init__ so PyTorch under
each worker does not fight for cores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Iterable[str]
|
Iterable of raw documents. |
required |
Yields:
| Type | Description |
|---|---|
list[list[str]]
|
Preprocessed documents in input order. |
load_mwe_list¶
Loads a curated MWE list, either the packaged "finance" list or a
newline-delimited file at a user-supplied path.
Load a curated MWE list from the packaged data/ or a file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path
|
Either |
required |
Returns:
| Type | Description |
|---|---|
list[tuple[str, ...]]
|
List of tuples, each a token sequence for NLTK |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the source cannot be resolved. |
apply_mwe_list¶
Applies NLTK MWETokenizer to each sentence as a post-pass. Splits around
[NER:*] placeholders so MWE patterns do not match across entity boundaries.
Apply NLTK MWETokenizer to each sentence as a post-pass.
Designed to run AFTER the main preprocessor, so MWE patterns the parser
missed (for example, customer_commitment) still get joined.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sentences
|
Iterable[list[str]]
|
Sentences of tokens. |
required |
mwe_list
|
list[tuple[str, ...]] | None
|
Loaded MWE tuples. If |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
Sentences with matching MWEs joined by |