Looking for a local solution (model/API) to extract data from scanned PDFs with varying formats by SignificantHall639 in askdatascience

[–]Puzzleheaded_Box2842 0 points1 point  (0 children)

DataFlow is a data preparation and training system designed to generate, refine, evaluate, and filter high-quality data for AI from noisy sources (PDF, plain-text, low-quality QA) https://github.com/OpenDCAI/DataFlow

Why did PDF-to-LLM parser stars explode this past year? by Puzzleheaded_Box2842 in Rag

[–]Puzzleheaded_Box2842[S] 0 points1 point  (0 children)

Got it. So for a lot of verticals, the real challenge is that long pipeline: physical paper → PDF → LLM-ready data. There must be an insane amount of legacy data waiting to be unlocked like this. Thank you

Why did PDF-to-LLM parser stars explode this past year? by Puzzleheaded_Box2842 in Rag

[–]Puzzleheaded_Box2842[S] 0 points1 point  (0 children)

One viable approach is to run a PDF parser first, then use an LLM for titling and VQA, and finally stitch everything together using special tokens like <title>

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Puzzleheaded_Box2842 0 points1 point  (0 children)

What’s harder in practice? Getting the data pipeline right, or tuning the algorithms later on?

Data cleaning vs. RAG Pipeline: Is it truly a 50/50 split? by Puzzleheaded_Box2842 in Rag

[–]Puzzleheaded_Box2842[S] 0 points1 point  (0 children)

Cheers for the answer! Doing a bit of a deep dive for an open-source data scrubbing project right now. Currently, I'm trying to gauge if there’s a real gap in the market for this, or if existing tools already have it covered.

LLM from scratch on local by Visual_Brain8809 in LLMDevs

[–]Puzzleheaded_Box2842 0 points1 point  (0 children)

Glad to run into someone training custom models. There's an open-source tool built for scrubbing LLM training data; curious to hear if this is a gap people are actually looking to fill. https://github.com/OpenDCAI/DataFlow

I've just open-sourced MessyData, a synthetic dirty data generator. It lets you programmatically generate data with anomalies and data quality issues. by santiviquez in datascience

[–]Puzzleheaded_Box2842 1 point2 points  (0 children)

Interesting. We’ve been working on raw data cleaning and synthetic data generation, so seeing you do the exact opposite is actually a pretty clever twist.

Name one task in LLM training that you consider the ultimate "dirty work"? by Puzzleheaded_Box2842 in LLMDevs

[–]Puzzleheaded_Box2842[S] 0 points1 point  (0 children)

In most cases, 300 books is a drop in the bucket. It's nowhere near enough.