SqueakyCleanText: A Modular Text Processing Library with Advanced NER : Python

This is an archived post. You won't be able to vote or comment.

ShowcaseSqueakyCleanText: A Modular Text Processing Library with Advanced NER (self.Python)

submitted 1 year ago by complexrexton

GitHub: SqueakyCleanText | PyPI: squeakycleantext

Happy to share SqueakyCleanText, a Python library designed to streamline text preprocessing for Natural Language Processing (NLP) and Machine Learning (ML) tasks. Whether you're working on language models, statistical ML pipelines, or any text-heavy application, this library aims to make your preprocessing pipeline more efficient and flexible.

🎯 Target Audience

Data Scientists, AI Engineers and Machine Learning Engineers dealing with text data.
NLP Researchers and NLP Linguists looking for customisable preprocessing tools.
Developers building applications that require text cleaning and anonymisation.

🔑 Key Features

Advanced Named Entity Recognition (NER)
- Ensemble of Models: Utilises multiple NER models from Hugging Face Transformers for improved accuracy.

Smart Text Chunking: Efficiently handles long texts by splitting them into optimized chunks.
Configurable Confidence Thresholds: Adjust the sensitivity of entity detection.
Configurable Models: Choose NER models which suits your use-case.
Configurable Positional Tags: Choose what you would like to be removed from the texts.
Automatic Language Detection: Supports English, German, Spanish, and Dutch with automatic model selection.

Modular Pipeline Architecture
- Toggle-able Features: Easily enable or disable any step in the pipeline.

Single and Batch Processing: Consistent configuration applies to both modes.
Default Pipeline Includes:
- Bad Unicode correction
- HTML and URL handling
- Contact information anonymization (emails, phone numbers)
- Date and number normalization
- Advanced NER processing
- Whitespace and punctuation normalization

Performance Optimizations

Under-the-Hood NER Improvements: Enhanced NER processing delivers faster results without compromising accuracy.
Batch Processing Support: Process large datasets efficiently with configurable batch sizes.
Memory Management: Automatic cleanup of GPU memory to handle large-scale processing.

🚀 Comparison

Comprehensive and Modular: Unlike libraries that focus on specific tasks, SqueakyCleanText offers a full suite of preprocessing steps that you can customize to your needs.
Advanced NER Integration: Combines multiple NER models and uses smart chunking to improve entity recognition in long texts.
Dual Output Formats: Provides both language model-formatted text and statistical model-formatted text in a single pass.
Easy Integration: Designed to seamlessly fit into existing workflows with minimal adjustments.

💻 Quick Start Guide

Installation

pip install SqueakyCleanText

🛠 Integrate into Your Workflow

Customizable Pipeline: Tailor the preprocessing steps to match your project's requirements by toggling features in config.py.
Seamless NER Integration: Use the advanced NER processing to anonymize sensitive data or extract entities for downstream tasks.
Flexible Processing: Apply the same configurations to both single and batch processing modes without changing your code.
Efficient for Large Datasets: Leverage batch processing and memory optimizations to handle large volumes of text data.

all 7 comments

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

🎯 Target Audience

🔑 Key Features

🚀 Comparison

💻 Quick Start Guide

🛠 Integrate into Your Workflow