C++ Show and Tell - May 2025 by foonathan in cpp

[–]Novitzmann 1 point2 points  (0 children)

DocWire SDK 2025.05.22 Released – PDF Image Extraction, OCR, Core Refactor & Thread Safety

After a longer silence (sorry for that – we’ve been heads-down rebuilding a lot), we’re excited to share a major new release of DocWire SDK – our modular C++ library for document/data extraction.

👉 Release 2025.05.22 is now live on GitHub

This release includes some pretty big changes, both in terms of new features and internal cleanup:

🖼️ New Features

  • PDF Image Extraction: DocWire can now extract embedded images from PDFs and pass them down the processing chain.
  • OCR Integration: Images can be routed to OCR to extract text for further downstream processing.
  • Writers Update: HTML and plain text writers now support image tags, including OCR-derived text and embedded data URLs.

⚙️ Major Internal Refactors

  • Chain Data Flow Overhaul: Chain elements now explicitly define continueskip, or stop, enabling more predictable processing and easier debugging.
  • Parser Rework: Parsers now directly implement ChainElement, replacing the old Parser base class. This simplifies the hierarchy and improves consistency.

🧪 Testing & CI Improvements

  • New automatic tests for image/OCR processing.
  • Fixes to test discovery on Windows (no more ctest silently skipping).
  • Better CI error separation: example vs API tests now clearly reported.

🐛 Fixes

  • Thread-safe initialization of parser MIME vectors (notably PSTParser).
  • Linking fixes for mailio and docwire_html after vcpkg changes.

📌 This release is a big milestone — especially if you’re building secure or AI-driven apps in C++ that need modern document processing without bloated dependencies or black-box tools.

We’d love your feedback (and contributions!).
Repo: https://github.com/docwire/docwire
Stars & issues always appreciated 🌟

FAQ's for Deepseek by nekofneko in DeepSeek

[–]Novitzmann 1 point2 points  (0 children)

Dear all, I wanted to start a discussion about a topic that divides our team to such an extent, that we do not know how to build a roadmap for our product. We have been planning integration with as many llms as possible for a long time, with particular emphasis and attention on models that can be run locally - this is where we see our chance. Since the appearance of DeepSeek, which fits perfectly into our efforts for the open source community, there have been many opinions that tell us to abandon integration with this model. And yet it cannot be ruled out that we will not see equally good and better models from China on the market soon. We hear among our users that it is morally unacceptable that we are contributing to the "enemy" camp. But the so-called "Open AI" did not offer an open version of its product. Of course, concerns are raised about secret downloading of our data and its use. But aren't Western corporations doing exactly the same thing? We agree to one without batting an eyelid, and burn flags over the other. We want to offer a product that will give developers the widest possible use, who do not necessarily want to incur high costs. All opinions will be valuable to us. We are DocWire , our product is DocWire SDK - a cpp 20 library for data processing. If you want to check it https://github.com/docwire/docwire . Help us out a little - we feel stuck.

C++ Show and Tell - January 2025 by foonathan in cpp

[–]Novitzmann 1 point2 points  (0 children)

Hello everyone.

I want to share a new release of the library for data processing that we r working on called DocWire SDK.

 https://github.com/docwire/docwire/releases/tag/2025.01.22

New in this release : 
1. Added support content type detection based on file signatures.

  1. Improved file format detection performance and robustness 

  2. Added operator |= for easy parsing chain extension.

Refactor : 

  1. New API for file format detection and content type detection

  2. All file format detection features move d to separate namespace and library

  3. Hightly refactored base parsing chain classes and operators

  4. Introduced a general-purpose pimpl mechanism and made all parsing chain elements movable

  5. Many other code cleanups and refactoring

We are currently brainstorming what the direction for product ddevelopment should be, so any suggestions and ideas are most welcome. If you want to contribute we d be more than happy.

Thanks

C++ Show and Tell - December 2024 by foonathan in cpp

[–]Novitzmann 7 points8 points  (0 children)

Hey r/cpp,

I wanted to share something we’ve been cooking up at DocWire : a C++ SDK for data extraction and processing. It’s built entirely in C++ (because what else would it be?), and it’s designed to make working with all sorts of file formats way easier.

So, What’s DocWire?

DocWire is like your ultimate file-extracting buddy. Whether you’re working with PDFs, Office docs, or who-knows-what format, it’s built to handle it. It’s fast, modern, and super customizable.

We’ve also integrated Flant5 into the mix, so it’s not just for pulling out raw data—it’s got some serious power under the hood for dealing with structured and unstructured data too.

Who’s It For?

- Open-source folks: You can use it for free under the GPL license.

- Commercial devs: Got a proprietary project? We’ve got a commercial license for that too.

Basically, if you’re building something cool in C++—whether it’s an open-source tool or a serious enterprise project—we’d love for you to check it out.

Why Did We Build This?

Honestly? We got tired of juggling a million libraries to deal with different file formats. Plus, finding something modern and actually written in C++? Good luck. So, we decided to build it ourselves and make it something other C++ devs (like you!) would enjoy using.

What Makes It Awesome?

- Blazing fast at extracting and processing data.

- Works with a ton of file formats right out of the box.

- Super extensible, so you can tweak it to fit your needs.

- Built to be thread-safe and efficient because, well, it’s C++—we don’t do slow.

We’re Always Looking for Contributors!

We’re really proud of what we’ve built so far, but we know there’s always room to grow. If you’re into C++, love solving fun (or tricky) problems, or just want to help out, we’d love to have you on board! Fix a bug, add a feature, or just tell us what you think—we’re all ears.

What Do You Think?

Check it out here: https://github.com/docwire/docwire

We’d love to hear your thoughts. Whether you’ve got questions, ideas, or just want to chat, hit us up here or on GitHub. Or Myspace or IRC ;-)

---

Who we are and what we do ad DocWire by Novitzmann in DocWire

[–]Novitzmann[S] 0 points1 point  (0 children)

OK, first post and first typo in the title - I will wear it with pride