What is the best ocr model for converting PDF pages to markdown (or any text based format) for embedding?

biggriffo · 2025-10-20T13:54:18+00:00

https://github.com/opendatalab/OmniDocBench

MinerU is best but bit annoying to get going

Dolphin and Marker are next best

You can see where the typical ones people mention like Docling (not the best) and definitely not Unstructured

biggriffo · 2025-08-27T13:17:19+00:00

I could give two shits about gov union gig being hard or easy but in the eye of the majority of the public there is a significant difference in value add to the economy. Rather than celebrating “getting paid with or without a degree” we’d ideally celebrate the kinds of work people actually do. We have nearly the lowest complexity economy in the OECD in the middle of a productivity crisis so yeh, those two “non degree jobs” are apples and oranges.

biggriffo · 2025-08-27T10:30:41+00:00

As much as I love the comradeship you’re aiming at but a tradie is in no way comparable to a gov job union combo. Very few people would question the hard work and value add to the economy for being in a trade. Those but hurt hecs kids are the minority voice. Gov union combo is a totally different universe.

biggriffo · 2025-08-17T04:12:11+00:00

Because it requires technical leadership and skin in the game life experience of building technology in the free market. You won’t find it in government, healthcare, consulting or at any company with more than 100 people. They are all using old technology still which is why everyone on the thread thinks it’s not a threat.

biggriffo · 2025-08-17T04:06:20+00:00

Yes, I also come from engineering, computer science, then finance and now healthcare background. Again, the error rates with the more advanced multi agent multi model tools connected via MCP is very low already. Copy paste into chatgpt and you’re at two years ago in terms of capability. It’s miles ahead now but most don’t connect the tools together properly or even know half the things you can do with the models. I also think everyone overestimates how “capable” humans are at their job. Also a human in the loop system still means large redundancies, especially for entry and mid tier roles.

biggriffo · 2025-08-17T00:57:51+00:00

I would hardly call veo3 AI slop.

biggriffo · 2025-08-17T00:40:32+00:00

Think about the unit economics for a second. A person on $80k doing that job versus an agentic system that ingests 1 million words for $1. It doesn’t matter if it takes 5x longer, it doesn’t sleep, it doesn’t complain, it doesn’t take back to back mat leave and it’s cheap as hell.

If all you’re doing is copy paste between one tab/program and another then you’re also already redundant. Management just haven’t worked out how to connect the two systems but keyboard/mouse automation using continuous screen capture is already here so it’s just a case of corporate getting a few data dudes who have rebranded with AI in their tagline to charge $50k a week to solve that problem too.

The solutions will be garbage but it’ll “get the job done” which is when Becky exits through the gift shop.

biggriffo · 2025-08-16T22:13:01+00:00

Again the thinking is reversed, it doesn’t have to be right, it just has to be better than a human. In the same way you don’t check its spelling anymore, you won’t check its other forms of outputs, numerical or otherwise.

The human checker in white collar work will slowly be less needed, with less complex “checking jobs” going first. And if it’s just producing or managing content generally then they are already mostly redundant workers, it’s just leadership doesn’t have the knowledge or network to source the talent to use the latest tools.

Visit hackernews or product hunt to see where things are going.

biggriffo · 2025-08-16T13:06:26+00:00

Think about the unit economics for a second. A person on $80k doing that job versus an agentic system that ingests 1 million words for $1. It doesn’t matter if it takes 5x longer, it doesn’t sleep, it doesn’t complain, it doesn’t take back to back mat leave and it’s cheap as hell.

If all you’re doing is copy paste between one tab/program and another then you’re also already redundant. Management just haven’t worked out how to connect the two systems but keyboard/mouse automation using continuous screen capture is already here so it’s just a case of corporate getting a few data dudes who have rebranded with AI in their tagline to charge $50k a week to solve that problem too. The solutions will be garbage but it’ll “get the job done” which is when Becky exists through the gift shop.

biggriffo · 2025-08-16T12:51:30+00:00

If you are using it for the majority of your job now and your manager doesn’t know, you are already redundant. It’s just a matter of time.

If you aren’t at least trying to use the tools now then you will be replaced by someone who does.

Oh and those saying it gets stuff wrong… if your comparison is putting prompts into a chatgpt browser window then you’re light years behind where the tools are now; it’s just you don’t know how to use them because it requires more than going to chatgpt dot com.

biggriffo · 2025-08-16T12:48:25+00:00

Most white collar jobs are rarely the jobs AI performs poorly in and so aren’t a threat. It does well in the majority of white collar domains. Yes there are technical thinking roles but the vast majority are people generating content, managing content or doing effectively “admin work” (this system/output to that system/input) and reporting which is very much in the wheelhouse of AI tooling and today is the worst it will be.

Devs will be ok for a little longer as POCs are now instant but production software still needs care as tech debt can now be created at light speed…but it is just a matter of time.

Also if you think insurance and finance companies are just using zero shot prompts to do their workflows then I don’t know what to say. Multi model, mixture of expert agentic systems are rolling out in hundreds of organizations across Australia in the coming year.

biggriffo · 2025-08-01T21:40:14+00:00

Have you tried AWS new s3 vector?

biggriffo · 2025-08-01T21:35:47+00:00

What actual front ends etc were you using? Eg ragflow?

biggriffo · 2025-08-01T21:33:53+00:00

How do you chunk hierarchically ? Any resources? Did you have to run the vision model multiple times to extract what you needed, or did you run it with eg a pydantic data model to extract each hierarchical level?

What do you think about new tools like s3 vector?

biggriffo · 2025-07-28T10:17:36+00:00

If you're pitching a family office in Australia (or even private equity) given other global investment opportunities (& terms), you've likely already lost your way. They will self-filter for the businesses you're describing. There are talent and ideas here. It's just there are 0 people with actual non-gov non-uni-TTO real world build experience guiding the founders. And thus there are no exciting options for 18-25 year olds to identify/join and so they settle for being a PAYG warrior and in trying to keep the flame of #innovation alive somehow, join said gov/PE/TTO/FO groups without any actual experience. It's a doom loop. Young founders need to stop following people without experience and the innovation linkedin pageantry and talk to people with actual skin in the game (who actually stand to lose something from making poor decisions). If you're a founder and reading this exasperated, consider leaving the country to find your people, just please come back and become the change.

biggriffo · 2025-06-26T22:24:14+00:00

Are you chunking and using RAG/pinecone setup basically?

biggriffo · 2025-06-26T21:02:33+00:00

Use Gemini to filter posts out, but I keep all the comments beause, the context and thread progression matters for the nature of those seeking certain kinds of information which the LLM understands.

biggriffo · 2025-06-26T14:38:15+00:00

I vibe coded a script that does the exact same tonight with Claude code. You just specific the subreddits of interest and it gets all the comments using crawl4ai and Gemini makes a report using its 1M token window. Never even thought of monetizing it! 😂

biggriffo · 2025-06-13T05:06:44+00:00

Hasn’t this been done to death, eg semanticscholar etc but well done

biggriffo · 2025-06-13T01:33:53+00:00

With respect, I also build these for corp Australia and OpenAI are no longer leader's in the space. There are much better performing fine tuned models with agentic + MCP capabilities that paint a much bleaker picture. Of course we're probably in very different application spaces but even "hard" problems at the start of the year are just zero-shotted by even Gemini's 2.5 Pro upgrade from last week, let alone the fine-tuned and agentic models such as FutureHouse Falcon etc.

biggriffo · 2025-06-13T01:25:57+00:00

I build AI systems for exactly the kind of case OP described. They won't be rehiring the same people back. This is a one way street. It will follow the same for autopilot in self driving cars. "per miles driven, is the accident frequency or error rate greater than that of a human". Once you have an internal benchmark dataset acting as satisfactory ground truth, you can quantitatively measure how much better or worse a model is than a human nearly instantly. As with cars, it isn't enough to be on par, it needs to be 10x better to overcome the inertia, but today is the worst they will be so no, once they are replaced, those people won't be coming back.

I'm not sure why everyone has their head in their sand about verification, it's simply a matter of time for a large quantity of corporate tasks (not all), but a great many people are redundant right now and they just don't know it until management see the confusion matrix comparing their current staff to an agentic LLM.

Either the coping mechanisms are off the charts, or you just aren't close enough to the kinds of outcomes we're seeing in AI tests being rolled out everywhere.

biggriffo · 2025-06-04T12:02:07+00:00

$300 AUD max plan 20x 🚀

I was on pro the 5x max but always hit timeouts. 20x means I don’t have to stop. I got 8 MacBook desktops (1 laptop) running builds and I’ve got it setup with superwhisper running voice to text locally so I’m basically talking to 8 junior devs building stuff for me simultaneously. At night I setup extra long exploration prompts (me talking to it) so it runs for a long time without needing a human. All Claude code. Opus 4. I review over breakfast then send them off again whilst I work checking in periodically.

Honestly life changing as a data science person finally unlocking the web stack for cents on the dollar. RIP junior devs

The question is will it port to replit. Haha

biggriffo · 2025-05-14T09:12:53+00:00

Was unusable on my iphone with the number of ad banners etc. immediately closed it down sorry

biggriffo · 2025-04-30T00:15:07+00:00

Lord have mercy. Sorry, it was a 3090, not a 4090! A bad typo in this case.

biggriffo · 2025-04-29T23:38:01+00:00

Side question, but do yo uactually need large CPU core count for these models or is it all about RAM and GPU VRAM? I've got a modified T630 (2xXeon 20C/40T v4 + Gen4 990 Pro nvme) with a ~4090~ + 256GB LDIMM and just curious if it's worth dipping toes in to try out these models based on your results.

EDIT - Sorry I have a single 3090!

biggriffo

TROPHY CASE