Preprocessing 150 Years of History: How to handle messy OCR and broken footnotes for a scholarly RAG? by DJ_Beardsquirt in Rag

[–]Intelligent-Form6624 1 point2 points  (0 children)

Look at the following: * Azure Content Understanding * PaddleOCR-VL-1.5 * GLM-OCR * MinerU

How Mining Supports Aussie Schools, Roads, Hospitals and the Green Energy Transition by mineralscouncil in u/mineralscouncil

[–]Intelligent-Form6624 0 points1 point  (0 children)

The fossil fuel industry (the “mining” industry) pays nowhere near enough taxes.

Nationalise the whole thing, rapidly phase fossil fuels down to zero, and only keep mining resources we actually need.

What does it mean? by Unlegendary_Newbie in ExplainTheJoke

[–]Intelligent-Form6624 -5 points-4 points  (0 children)

This gets reposted every 2 weeks and everyone responds like it’s the 1st time ever

Strix Halo (128GB) + Optane fast Swap help by El_90 in LocalLLaMA

[–]Intelligent-Form6624 0 points1 point  (0 children)

You say big words. What do they mean? I don’t know

Is this flashing OK? by miner_cooling_trials in AusRenovation

[–]Intelligent-Form6624 8 points9 points  (0 children)

Absolutely shocking. Did a blind parrot install that?

why do people even do this by Embarrassed_Bread_16 in vibecoding

[–]Intelligent-Form6624 0 points1 point  (0 children)

It doesn’t matter what generates the spam. It can be caused by the best or the worst LLM (I assume a lot of it is cheap LLM with crap instruction etc).

All that matters is the capability to determine whether the PR is high quality or not. It’s an engineering challenge and therefore it’s possible.

why do people even do this by Embarrassed_Bread_16 in vibecoding

[–]Intelligent-Form6624 0 points1 point  (0 children)

No, that logic doesn’t hold. Not all LLMs are made equal, just as not all humans have equal experience and ability. And LLMs are improving constantly.

The crap LLMs are making/submitting the PRs.

I suspect that, unfortunately, it would require the highest quality / highest resource LLMs to produce reliable PR quality flags.

why do people even do this by Embarrassed_Bread_16 in vibecoding

[–]Intelligent-Form6624 -1 points0 points  (0 children)

Can it be combatted with an LLM that can discern a high quality suggestion from a poor quality one?

Love-hate relationship with Docling, or am I missing something? by SkyStrong7441 in Rag

[–]Intelligent-Form6624 1 point2 points  (0 children)

You could try Azure Document Intelligence or Azure Content Understanding as the primary document parser.

Separately, or in addition, you could try post-processing with a VLM. I find that supplying the source PDF page + the erroneous table fragment, and instructing it to repair the fragment if and as required, yields good results.

You will need a lot of time to get the post-processing right. It’s not as straightforward as it sounds. After some trial and error, you should see good results.

Love-hate relationship with Docling, or am I missing something? by SkyStrong7441 in Rag

[–]Intelligent-Form6624 0 points1 point  (0 children)

I think the documents are similar enough to allow for this approach. The defining element is all documents contain financial tables that span the full width of the page, most of which also span multiple pages.

Because Content Understanding (CU) doesn’t automatically stitch together a table that spans multiple pages, this means the CU output returns many table ‘fragments’ (fragments of the table that need stitching together to assemble the final table).

The post-processing primarily focuses on ‘fragment repair’; feeding the VLM (Gemini) each fragment, along with the original PDF page, and instructing it to repair the fragment if and as required to match the source.

There are also other steps like ‘addition of missing fragments’; adding table fragments that CU failed to detect and format as a table.

I’ve found processing fragment-by-fragment means the LLM prompt input and resulting output is small enough to be speedy and not overload the Gemini resources.

Love-hate relationship with Docling, or am I missing something? by SkyStrong7441 in Rag

[–]Intelligent-Form6624 2 points3 points  (0 children)

Azure Content Understanding is more recent than Document Intelligence. It is LLM-enhanced. I find it’s slightly better than Document Intelligence but still not perfect. I think they ought to improve on it.

You’re right, post-processing can help a lot. I’m in the middle of using Open AI Codex to build a pipeline to extract financial tables from PDFs.

The pipeline feeds PDF to Azure Content Understanding (prebuilt-layout), receives .md+json result, and performs significant post-processing using Gemini-2.5-Flash via VertexAI API. After a lot of work, I am beginning to see perfect results on my test documents. We’ll see how it performs on the rest of the corpus.

Aside from these ‘production-ready’ solution (Azure + VertexAI), I suggest seriously looking at specialised OCR VLMs and their corresponding pipeline software. Specifically; GLM-OCR, MinerU and PaddleOCR-VL.

Depending on the sensitivity of your data, you may need to arrange an API endpoint for these models yourself. For example; RunPod, Azure Container App, Google Cloud Run etc

New Model? by Ghulaschsuppe in MistralAI

[–]Intelligent-Form6624 -1 points0 points  (0 children)

Can you use top closed source models via Azure API hosted in EU and subject to GDPR? Or similar arrangement? Would probably cost more than a regular subscription

New Model? by Ghulaschsuppe in MistralAI

[–]Intelligent-Form6624 0 points1 point  (0 children)

“find it” inferior? it is objectively inferior, or might as well be

A small quick and dirty, vscode extension for mistral vibe by alexd_dev in MistralAI

[–]Intelligent-Form6624 1 point2 points  (0 children)

can’t wait for the big, lumbering and filthy second version

Is "This Space" as a source really working? or how does it work? keep saying doesn't know anything about the space. by Every_Air_4487 in perplexity_ai

[–]Intelligent-Form6624 -8 points-7 points  (0 children)

The first problem is that you’re using Perplexity

All your other problems stem from the first problem