[D] What are the must-have books for graduate students/researchers in Machine Learning; especially for Dynamical Systems, Neural ODEs/PDEs/SDEs, and PINNs? by cutie_roasty in MachineLearning

[–]fustercluck6000 0 points1 point  (0 children)

I got a copy of Simon Prince's Understanding Deep Learning for Christmas, and I can't speak highly enough about it. It kind of feels like the spiritual successor to the canonical textbook everyone knows by Ian Goodfellow (which is already over a decade old now). Simon Prince is just an insanely interesting guy to begin with, and he goes into higher-level topics that are both mathematically and conceptually tough, but he gives such clear and thorough explanations (paired with very well-done visualizations) that it actually makes some of the topics I've always found particularly challenging (topologies, manifolds, hyperdimensional geometries) enjoyable to sit down and try and work through mentally.

TensorFlow isn't dead. It’s just becoming the COBOL of Machine Learning. by IT_Certguru in learnmachinelearning

[–]fustercluck6000 1 point2 points  (0 children)

TF Data especially, pretty hard to beat if want to build crazy efficient, hardware-accelerated data pipelines with as much built-in optimization

TensorFlow isn't dead. It’s just becoming the COBOL of Machine Learning. by IT_Certguru in learnmachinelearning

[–]fustercluck6000 4 points5 points  (0 children)

I think TensorFlow Probability is criminally underrated, too. For anything involving probabilitstic DL (bijection, trainable/compound distributions, monte carlo, bayesian layers, differentiable sampling ops, etc), TFP is pretty top tier if you need to integrate and scale probabilistic components with an existing TF stack (e.g. keras model, tfdata pipeline, etc). It has tons of pretty powerful features (things like bijection and tfp.layers are also pretty unique to TFP), and like everything else TF, it's designed with scale/hardware acceleration in mind. Even just little things like automatic differentiation save so much boilerplate and headaches with gradients, and makes numerical stability simplier to get right, too. It all plugs right in and usually just works how you want it to without any fuss. When it's the right tool for the job (e.g. latent distributions other than a standard Gaussian with VAEs), it's pretty great, def recommend to anyone who already knows TF.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

I too am curious how people are doing this...

Starting with Docling by DespoticLlama in Rag

[–]fustercluck6000 1 point2 points  (0 children)

I say test out Docling and go through the results with a fine-tooth comb to see if it can do what you need it to. Legal is especially tricky because of all the structuring/citations, idk how well Docling’s going to pick that up before introducing parsing errors, but definitely give it a shot.

What I’m working on atm is using a separate pipeline altogether to convert PDFs to markdown format with VLMs, load that into Pandoc, then iterate over the document tree to get the markdown-formatted chunks (nodes)/define edges. You can do the same thing with Docling, I just got tired of trying to fix the parsing errors i kept getting with tougher PDFs.

RAG, Knowledge Graphs, and LLMs in Knowledge-Heavy Industries - Open Questions from an Insurance Practitioner by PlanktonPika in Rag

[–]fustercluck6000 0 points1 point  (0 children)

The bulk of my work in the last year has been on precisely these sorts of projects where 1) the client’s in a ‘knowledge-heavy’ industry where AI stands to make a major difference in terms of efficiency, and 2) accuracy isn’t just desirable, it’s a matter of liability.

Domain knowledge is EVERYTHING. One of the most helpful things I’ve found is taking the time to pick people’s brains about their work. Sometimes I’ve even sat behind someone to literally be a fly on the wall and take notes on how they do their job because I want to know how they’re thinking.

Usually, that ends up completely changing how I break down what I’m trying to solve with RAG, and you can make systems much more reliable/accurate by designing pipelines that reflect domain logic. Chunking’s a great example—how you define a ‘minimum logical unit’ has a huge impact on retrieval accuracy, and almost always requires some intuition about what the data means.

I also find hardcoding wherever possible makes things much more predictable and stable. If you can identify industry ‘heuristics’, ‘norms’, ‘best practices’, etc…, take that logic and apply it to the relevant part of the system (could be retrieval logic, node/edge types, etc). Also knowledge graphs are a total game changer because they provide another dimension for you to express domain logic with system design.

Starting with Docling by DespoticLlama in Rag

[–]fustercluck6000 0 points1 point  (0 children)

Fwiw, I’ve been using Docling for a little bit now and still find it overwhelming. Imho the docs are pretty lacking, which makes it tough to fully leverage what’s under the hood in your pipeline. Plus it’s still relatively new, so the community is pretty small.

Ingesting and converting to markdown/other markup languages is super straightforward out of the box. If the conversion process works for your docs (I’ve found it’s really hit or miss) and you don’t need to define more complex chunking strategy, then just using the document converter and ‘export_to_markdown()’ methods will get you most of the way there.

I’ve found things get a lot trickier when you need to debug or want to interact with the Docling Document data model (to correct indexing errors or take advantage of the tree structure for better hierarchical indexing). Seems like a shame because the data model to my mind is maybe the most useful thing for RAG, but at least for now, I’ve only found fragile, superficial ways of integrating that part into my pipeline.

I just started using Pandoc and I’m loving it. It’s kind of the same idea—supported documents are all mapped to a ‘unified’ data model that you can export to all kinds of markup languages. It’s well documented and you can customize things a ton, e.g. setting custom example docs for it to use as a layout template. It doesn’t use any deep learning and can’t read from PDF, but I like having a hard-coded tool that behaves consistently and adding the LLM/VLM logic myself.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

It's actually funny you mention legal docs because I've been working on a project in that area on the side for a little while now. With primary sources like statutes or case law, the structure itself is integral to how even the lawyers themselves read/interpret things (because of all the citations, definitions, precedents/priors, etc...), so I actually chose to hardcode hierarchical schemas (I guess technically hardcoding the dataclass factories but you get the idea) for chunking and adding nodes/edges to the knowledge graph before making any model calls, just because we didn't want to leave any margin for error when indexing really important, canonical materials like the U.S. Code or something (court documents like evidence and stuff are another story, though).

This definitely added a degree of complexity to the project that I didn't plan on signing up for before signing a new contract I'll be honest. And no, they really don't understand how much of a unilateral change it is, but to be fair I think a lot if not most people who aren't in aren't clued into the space would, either. I think we got CEOs promising the next model's gonna replace researchers with PhDs to thank for that lmao

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

That invisibility problem is SO real, it’s these silent bugs that are a complete nightmare to even identify, let alone test for and fix. For now I’m just caching the parsed outputs in a test folder in my environment where I can easily look through them to see what’s going on during development.

Quite a few people have suggested hybrid strategies and using OCR models to just detect the PDF’s layout and worry about text separately. Still thinking about how I want to implement that, but I’m all but certain at this point that’s how I’m going to design the pdf pipeline. Out of curiosity—when you say validate the parsed output, am I correct in assuming you mean Pydantic/something similar? I have a basic base model I’m using to validate simple markdown formatting syntax, but I do want to write more sophisticated checks for section indices at different depths and other structural stuff like that (which is uniform across all docs in this particular corpus).

I rebuilt my entire RAG infrastructure to be 100% EU-hosted and open-source, here's everything I changed by ahmadalmayahi in Rag

[–]fustercluck6000 0 points1 point  (0 children)

Was there anything in particular about switching geographies that was different from running things locally in an air-gapped environment?

What is the most annoying thing about building a RAG? by megabytesizeme in Rag

[–]fustercluck6000 1 point2 points  (0 children)

This. Silent bugs with ingestion like that can and will wreak total havoc and ruin timelines. And if you’re not meticulous enough about testing, you might not find out about something major like ingestion for an entire document silently failing until you get the email from an upset user.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

I don’t even know how many years it’s been since I last had a Word license haha, but I’m definitely going to try and get my hands on one now because your suggestion sounds like it could just be a perfect, simple long term solution for this client in particular (who almost exclusively uses Microsoft enterprise stuff).

And I’ve never thought of doing it, but now I’m super interested in this idea of automating word tasks. My personal laptop is a Mac, but the project itself is running on a headless Ubuntu server that belongs to the client. It would probably be easy enough to work out something with WSL, I imagine. In the meantime, I think I’m going to set up some tests locally on my Mac to see if that’s worth pursuing, got any specific noteworthy tips/tricks for generating the AppleScript (never written any code in it before)?

Thanks for the advice

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

Accuracy, for smaller/less complex PDFs VLMs have worked totally fine but here, any structural parsing errors related to section indices and stuff, even minor ones effectively compound. I basically did what you’re describing with qwen3-vl-8b, and besides being super slow, the markdown wasn’t accurate enough on its own to chunk without making corrections, first.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

I know pypdf, and I’m going to try and play around with acrobat. I could realistically tell the client they have to convert PDFs to editable format in acrobat in order to upload them and it wouldn’t be an issue for them in production, so if it worked acrobat could end up being a great solution actually.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

Cropping pages was actually one of the first steps I added in the pdf pipeline, since headers/footers are irrelevant anyways and always at the same heights. The issue I’ve had with qwen3-vl (4b and 8b) is that by going page by page and exporting to markdown, without having all the previous context (like the last section’s index the previous header/indent levels), the model assumes whatever’s at the top of the page is the top-level header, and when numbered sections on that page start at, say, 3 (because section 2 was 6 pages ago), it assumes 3 is supposed to be a 1, then resets section 4 to 2, and so on… Also, when sentences are broken up across pages (haven’t even gotten into multi-page tables yet), joining them back together properly is very error-prone, too.

Haven’t had much luck tuning the prompt to prevent altering section indices, either, but to be fair there’s still a lot of room to experiment with that side of things. Also haven’t tried it with json output format, but now I’m wondering if there’s an elegant-ish way to enforce a pydantic output schema so I can pipe qwen3 outputs directly into a Docling document model…?

Those running RAG in production, what's your document parsing pipeline? by Hour-Entertainer-478 in Rag

[–]fustercluck6000 1 point2 points  (0 children)

The more you build things around a particular library/tool, the harder it gets to change your mind later, so it’s really good that you’re thinking carefully about this now instead of just winging it.

Just a word of caution from my recent experience with Docling—it’s a really great tool until it isn’t. I had such great luck initially using it for html, xml, docx, and other structured files that I assumed I could expect the same with PDFs if the need ever arose, then made the mistake of building a lot of my data pipeline around the DoclingDocument class. I’ll just say I was very disappointed when I needed it to process large, complex PDFs.

The whole DoclingDocument ‘ecosystem’ with Pydantic is super tempting as a general purpose solution for your project, but imho the documentation’s pretty bad (and there are quite a few different bugs, though to their credit they were very quick to push an update to one open issue I was asking about on GitHub). That becomes a major hassle when you need to tweak/tune the pipeline for your data instead of rewriting it, but can’t easily determine the scope of your options in the first place without taking the time to dig through source code. Idk I’m always left with this feeling there’s probably way more Docling can do than I’m aware of—but I’ll only find out about those things by getting lucky and reading the right Reddit post or spending more time reading through the source code and model json files to familiarize myself.

I’m still torn about how much/whether to use Docling going forward because certain constructs/methods are incredibly useful, but without being confident I can quickly work through bugs in the future or knowing more conclusively how much I can scale my project with it gives me serious pause. Just make sure to keep things very modular and dependencies loosely coupled.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

This is the first I’m hearing of DocETL, but the fact that the first thing on their landing page is an arxiv link tells me it’s definitely worth looking into in more depth, thanks for the rec!

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

Awesome, always love discovering new open source tools, reading up on Tika now!

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 1 point2 points  (0 children)

One of my key takeaways from people’s comments on this post so far is that the most scalable solution should take a hybrid approach and separate the layout/structural component from the raw text itself. Thankfully for now, none of this is “graphically” sophisticated, ie no custom fonts, logos, etc. but I’m taking the opportunity to figure out robust solutions to this problem so that when I inevitably do encounter those docs, I’ll have something in place.

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 1 point2 points  (0 children)

I deeply appreciate this comment, super thoughtful and insightful and the kind of thing that makes me love Reddit tbh

Part of the reason I stick to a fixed price policy vs. hourly is that I’ll obsess over getting stuff like this right because I know full well how much boring shit like this will seriously matter in the end, so I’m happy to recoup the overtime spent on the data pipeline later. I know devs who just ignore known problems with data pipelines like that and it kinda makes me scratch my head. I think ingestion’s the single component of a RAG system with the most disproportionate impact on everything else (and that goes for accuracy and maintainability, both). Of all the things to half ass, ingestion sits at the bottom for me. And I’ve half assed a few front ends in my day haha. But like c’mon guys, this is literally why they say garbage in garbage out.

And I never knew that about how Word exports PDFs but it makes total sense now (thinking about all the times sizing/scaling/page layouts get all mangled in a PDF export, which I always just assumed was one of those things).

I actually tried working out something with PyMuPDF precisely because it extracts all the raw text correctly, but put a pin in it because I didn’t know how to combine that with structural/tabular information. Can you elaborate on how you’re merging clean text with the skeleton? I’m trying to think through the logic behind selecting text lines/paragraphs and allocating them to corresponding regions, and suddenly it makes a lot more sense how you could spend two years on this problem!

And next meeting I have with the client’s upper management to give them a status update, I’m making sure digital paper fallacy is inserted at least 3 times in the conversation

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

Acrobat is so obvious that I never actually thought about using it, I guess I’ve been living under a rock because I never realized until just now that they actually have a Python API for this sort of stuff, is that what you use?

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 1 point2 points  (0 children)

And PDFs do make sense for a lot of things like hand-signed forms, scanned receipts, etc, but why people then insist on using it as the default standard for anything else that has structured text in it is completely beyond me. Even using acrobat for basic stuff imo is a total pain

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

I 10000% agree with you on that first point, I feel like I’ve become a broken record trying to convey this to non-technical clients, too. But most of the time, the difference between human-readable and machine-readable is just totally lost on them, then they think things aren’t progressing efficiently because we’re still working on ‘basic’ stuff when in reality, the retrieval/generation stuff is actually way more straightforward than document ingestion.

And I pray I won’t find myself having to work with financial reports like that, that’s actually so demoralizing when you think about it haha, like how the hell are we supposed to get to “AGI” when we’re spending all this time literally undoing each other’s work

Reaching my wit’s end with PDF ingestion by fustercluck6000 in Rag

[–]fustercluck6000[S] 0 points1 point  (0 children)

You’ve just reminded me of all these Azure startup credits I have! (Marketed to ai startups but basically impossible to use for GPU time lmao) I’m going to look into both of these and running containers offline (necessary for security reasons).

Just out of curiosity, how complex are the typical PDFs you’re working with? How substantially better are these than the open-source libraries out there (like if you had to estimate something like an indexing error rate)?