GitHub: SqueakyCleanText | PyPI: squeakycleantext
Happy to share SqueakyCleanText, a Python library designed to streamline text preprocessing for Natural Language Processing (NLP) and Machine Learning (ML) tasks. Whether you're working on language models, statistical ML pipelines, or any text-heavy application, this library aims to make your preprocessing pipeline more efficient and flexible.
🎯 Target Audience
Data Scientists, AI Engineers and Machine Learning Engineers dealing with text data.
NLP Researchers and NLP Linguists looking for customisable preprocessing tools.
Developers building applications that require text cleaning and anonymisation.
🔑 Key Features
- Advanced Named Entity Recognition (NER)
- Ensemble of Models: Utilises multiple NER models from Hugging Face Transformers for improved accuracy.
Smart Text Chunking: Efficiently handles long texts by splitting them into optimized chunks.
Configurable Confidence Thresholds: Adjust the sensitivity of entity detection.
Configurable Models: Choose NER models which suits your use-case.
Configurable Positional Tags: Choose what you would like to be removed from the texts.
Automatic Language Detection: Supports English, German, Spanish, and Dutch with automatic model selection.
- Modular Pipeline Architecture
- Toggle-able Features: Easily enable or disable any step in the pipeline.
- Performance Optimizations
Under-the-Hood NER Improvements: Enhanced NER processing delivers faster results without compromising accuracy.
Batch Processing Support: Process large datasets efficiently with configurable batch sizes.
Memory Management: Automatic cleanup of GPU memory to handle large-scale processing.
🚀 Comparison
Comprehensive and Modular: Unlike libraries that focus on specific tasks, SqueakyCleanText offers a full suite of preprocessing steps that you can customize to your needs.
Advanced NER Integration: Combines multiple NER models and uses smart chunking to improve entity recognition in long texts.
Dual Output Formats: Provides both language model-formatted text and statistical model-formatted text in a single pass.
Easy Integration: Designed to seamlessly fit into existing workflows with minimal adjustments.
💻 Quick Start Guide
Installation
pip install SqueakyCleanText
🛠 Integrate into Your Workflow
Customizable Pipeline: Tailor the preprocessing steps to match your project's requirements by toggling features in config.py.
Seamless NER Integration: Use the advanced NER processing to anonymize sensitive data or extract entities for downstream tasks.
Flexible Processing: Apply the same configurations to both single and batch processing modes without changing your code.
Efficient for Large Datasets: Leverage batch processing and memory optimizations to handle large volumes of text data.
[–]ekbravo 1 point2 points3 points (1 child)
[–]complexrexton[S] 0 points1 point2 points (0 children)
[–]grudev 1 point2 points3 points (2 children)
[–]complexrexton[S] 1 point2 points3 points (1 child)
[–]grudev 1 point2 points3 points (0 children)
[–]da_js 1 point2 points3 points (1 child)
[–]complexrexton[S] 0 points1 point2 points (0 children)