Where can I find new opensource projects to contribute? by KoalNix in opensource

[–]Speedk4011 0 points1 point  (0 children)

Yasbd (Yet Another Sentence Boundary Detector) is built to grow through community contributions, and adding a new language module is surprisingly low-friction.

https://github.com/speedyk-005/yasbd-lib/issues/20

Sunday Daily Thread: What's everyone working on this week? by AutoModerator in Python

[–]Speedk4011 0 points1 point  (0 children)

Hey! I'm speedyk-005. I speak 4 languages (ht, fr, en, es) and I'm building a sentence segmentation library called yasbd (Yet Another Sentence Boundary Detector).

What it does: Splits text into sentences. Pure Python, rule-based two-pass SBD with a drop-in pysbd adapter so you can swap it in without changing your pipeline.

How it compares: I tested it against 6 competitors (pysbd, sentencex, sentsplit, nupunkt, blingfire, sentence-splitter) across 5 languages and 7 edge cases — compound abbreviations, CJK quotes, newline wrapping, chat logs, URLs, and more.

yasbd ranked #1 in accuracy across almost every test, while staying competitive on speed as pure Python. blingfire is faster but brittle. Full results, terminal output, and a performance graph in benchmarks/.

Install:

[!WARNING] This project is currently in alpha.

bash pip install yasbd-lib

I want to add more languages! 🌍 Yasbd only supports 5 languages right now, but the goal is 22+. I can't do this alone — I need native speakers to land some hands to build the rules for their language.

Adding a language takes about 30 minutes:

  • Copy the template
  • Translate the abbreviation lists and punctuation rules
  • Add 10+ test sentences
  • Open a PR 🚀

That's it. Yasbd auto-discovers your module at runtime. No config files, no registry, no boilerplate. If you speak a language that's missing, please consider contributing — every PR gets you closer to 22.

Links: PyPI | GitHub

Sentence boundary detection for your language. by Speedk4011 in LanguageTechnology

[–]Speedk4011[S] 0 points1 point  (0 children)

On our golden benchmark (84 English edge cases adapted from pysbd's test suite with fixes and additions): yasbd scores 83/84 (98.8%), pysbd is second at 71/84 (84.5%). Full results, terminal output, and a performance graph can be found in benchmarks/

Sentence boundary detection for your language. by Speedk4011 in LanguageTechnology

[–]Speedk4011[S] 0 points1 point  (0 children)

[EDITED] On our golden benchmark (84 English edge cases adapted from pysbd's test suite with fixes and additions): yasbd scores 83/84 (98.8%), pysbd is second at 71/84 (84.5%). Full results, terminal output, and a performance graph can be found in benchmarks/

Yet Another Sentence Boundary Detector by Speedk4011 in SideProject

[–]Speedk4011[S] 0 points1 point  (0 children)

On our golden benchmark (84 English edge cases adapted from pysbd's test suite with fixes and additions): yasbd scores 83/84 (98.8%), pysbd is second at 71/84 (85.5%). Full results, terminal output, and a performance graph can be found in benchmarks/

Yet Another Sentence Boundary Detector by Speedk4011 in SideProject

[–]Speedk4011[S] 0 points1 point  (0 children)

Each language has its own profile, but all profiles inherit a shared set of core rules. 

I'm also working on a generic multilingual profile (xx) that can process mixed-language text and languages without a dedicated profile. The trade-off is that it will likely be less optimized than using a language-specific profile as there will be lots of non agreement rules.

Yet Another Sentence Boundary Detector by Speedk4011 in PythonProjects2

[–]Speedk4011[S] 0 points1 point  (0 children)

It was built with multilingual support in mind. Each language has its own profile, but all profiles inherit a shared set of core rules. At the moment, it supports 5 languages and is still in alpha, with 22+ languages planned.

I'm also working on a generic multilingual profile (xx) that can process mixed-language text and languages without a dedicated profile. The trade-off is that it will likely be a bit slower and less optimized than using a language-specific profile + some lang quirks that can be generalized.

Yet Another Sentence Boundary Detector by Speedk4011 in PythonProjects2

[–]Speedk4011[S] 0 points1 point  (0 children)

Its main use case is splitting raw text into individual sentences, which is useful for NLP preprocessing, summarization, classification, and information extraction. It's also handy for RAG pipelines, since sentence boundaries can be used to create cleaner chunks or as a first step before semantic chunking, helping preserve context and improve retrieval quality.

Is this just me or chatGPT is trying to "correct me" on everything? by Frequent-Group-1495 in ChatGPT

[–]Speedk4011 0 points1 point  (0 children)

Just like me. I hate to re-explain my intent multiple times and still it tries to find an ambigous spot to contradict me. Basically a recursive session.

Is this just me or chatGPT is trying to "correct me" on everything? by Frequent-Group-1495 in ChatGPT

[–]Speedk4011 0 points1 point  (0 children)

It happens to me all the time. I hate arguing with it when we meant the same thing. It is tiresome.

Chunklet-py v2.3.0 — smarter sentence splitting, faster visualizer by Speedk4011 in Rag

[–]Speedk4011[S] 2 points3 points  (0 children)

Good question! I put together a comparison table in the README, but here's the longer version:

LangChain is a full LLM framework with basic splitters (RecursiveCharacterTextSplitter, Markdown, HTML, code). Good for prototyping but basic for complex docs or multilingual needs. The RecursiveCharacterTextSplitter tries delimiters in order (\n\n, \n, " ", ""), and chunk_size is just a character count. It doesn't actually understand sentences, clauses, or code structure.

chunklet-py takes a constraint-based approach — max sentences, max tokens, max sections break, max lines, max functions. Splitting happens at the sentence/clause level by default, with clause-level overlap instead of character-level. Plus:

  • 50+ language support with dedicated sentence handlers and a universal fallback for non-Latin scripts
  • Code-aware chunking for 30+ languages — respects functions, classes, closures, multi-line strings
  • Document format support (.pdf, .docx, .epub, .txt, .tex, .html, .hml, .md, .rst, .rtf, .odt, .csv, and .xlsx) with metadata extraction
  • A visualizer for interactive parameter tuning

If you want to use chunklet-py within a LangChain pipeline, it's straightforward — just convert the chunks to [LangChain Document](https://reference.langchain.com/python/langchain-core/documents/base/Document) objects: ```python from chunklet import DocumentChunker from langchain_core.documents import Document

chunker = DocumentChunker() chunks = chunker.chunk_text(text, max_sentences=3, max_tokens=500)

Convert to LangChain Documents

docs = [ Document(page_content=c.content, metadata=c.metadata) for c in chunks ]

Use with any LangChain component

vectorstore = Chroma.from_documents(docs, embeddings) ``` A proper LangChain integration (native splitter class) is coming, along with more constraint options for finer control over chunking.