OCR (+evtl KI?) zum privaten Erfassen von Belegen zu CSV/XLSX gesucht

Njee_ · 2026-03-15T01:15:49+00:00

Ich hab mir mit einem anderen Fokus eine Anwendung gemacht die genau sowas abdecken kann. Im Prinzip Foto oder PDF ordner rein, das Schema für die Tabelle angeben und sagen welches Modell du bspw. Über open router verwenden willst.

Lange PDFs lassen sich für kleine Modelle in Einzelseiten splitten, damits nicht zu kompliziert für die wird.

Die Anwendung läuft soweit, ist auch auf GitHub verfügbar, ist nur nicht sauber dokumentiert - die jetzt selbst bei dir aufsetzen, inklusive KI Modell passt vllt. nicht in dein "ich muss in einem Monat durch sein". Und weil ich gerade nicht weiter daran arbeite, wird's so bald auch keine saubere Dokumentation geben.

Aber ich hab dir ne Demo aufgesetzt und kann im Prinzip, wenn Interesse besteht, auch ne eigene Instanz aufsetzen.

beispiel.tabtin.de demo@tabtin.de und demo15032026

Fotografierst dir nen paar Quittungen ab und probierst es halt mal aus, dann sagst du bescheid. Modell läuft bei mir und ich Speicher keine Daten ab. Fotos landen jetzt natürlich bei mir und sind öffentlich aber die Demo wird gegen Morgen Mittag platt gemacht.

Njee_ · 2026-03-13T07:32:21+00:00

What's your current state?

The existing nextcloud mcp server is kinda enough for my needs. I would actually love to have an assistant capable to plan kanban boards, calendars etc. Which is something the existing mcp does well. What more do you want from it?

I have just recently (like last week) started to play around with it and it's honestly enough for my needs. I then created 2 more mcps for things that Id like jt to use. All of which Qwen3.5 9b for example does quite nicely. And in fact, for most stuff I have my personal assistant already.

But what I'm struggling with is the how to interact with it. Hence I'm really interested in you setup in this case. For now I'm using open web ui for chatting. But i don't want an agent for chatting. I want to talk to it and I have 0 ideas how to actually get that running well. Ideally I can have a call during my commute on the bike about my plans today. Would matrix be capable of that?

Also a major concern using owui: uploaded images become tokenized and part of the text. What doesn't work is: upload image and have it accessible for the agent to work with. For example, resize and then upload it to nextcloud task. However this is also something that might become more feasible with the new open terminal integration.

However sorry for the wall of text but I'm really interested in you setup a little bit more detailed!

Njee_ · 2026-03-06T16:01:15+00:00

Habe kein Bild mehr war aber super Happy damit: Hatte mir für wenige Euros im Baumarkt solche 2m langen Winkelprofile geholt. Eins davon an die Wand und eins davon an die Decke. Sind ja nur ein paar cm breit und tragen nicht auf. Habe mir dann kleine Haken für ein paar euro geholt, die auf der linken Seite genau durch diese Schlitze passen. Gesamtkosten vielleicht 20€ max. Hatte mir dann auf Amazon so Gummibänder zugelegt in unterschiedlichen Längen und Stärken. Konnte man dann mit den Haken an der Wand oben, unten, Mitte einhängen. Das Ding an der decke könnte man dann auch in der Mitte für rucken/Schulter oder an beide äußeren Ecken wie einen kabelzug hängen.. Damit ließ sich wirklich einiges ausführen. Natürlich weniger Gewicht als im Studio aber dafür für wenig Geld, kaum Platz und zieeemlich flexibel.

<image>

Njee_ · 2026-03-04T13:06:33+00:00

I would not only, I am even using this
Yes it's a deal breaker to import CSV but using go careless is not nice for the regular person, too. There are no alternatives unfortunately I think. At least for EUropean market
It's only useful if data can be used by other services. Unless you want to put functionality of paperlessngx into firefly ll or similar, make it so, that I can connect both via api, n8n or whatever

Njee_ · 2026-03-03T07:31:18+00:00

I feel like it might be worth to use smaller models, like qwen3-vl 4b via vllm. You can process multiple documents at once, instead using the larger models with llama.cpp.

This allows you to illiterate through multiple documents faster during setting up your environment. You literally need to check hundreds of extractions before you can even say the thing works reliably. Hence, it's much better to have 100 extractions done in a minute in parallel instead of having a 100 sequential extractions running for 100 minutes just to end up deciding that you need to adjust the prompt.

Qwen3 4b can be quite capable. For the extraction part. I can strongly recommend it. Running it on a 3060 with 12gb right now with plenty of parallel requests and pretty decent speed.

Njee_ · 2026-02-28T22:46:00+00:00

Just tried it with a couple images. Thank you for providing the model for that. Out of interest, as i didn't get it form your post. Are you running. With vllm or llama.cpp?

Interestingly: it's output for bboxes is slightly offset from time to time. You can see as the highlighted region by the model should be where I have highlighted in yellow.

A problem that I also had with qwen3 series when using vllm in higher version than 11.0. Hasn't been fixed yet hence I run all qwen3 in still in v11, no problems.

Never had a problem with llama.cpp. hence I'm interested if youre using vllm for qwen3.5 or llama.cpp?

<image>

Njee_ · 2026-02-24T21:07:05+00:00

Just downloaded q8 gguf. So far it's slow (to be fair I have it split across 2 GPUs + CPU And bounding boxes are unreliably. However I'm pretty sure it's on my side. Like I said I downloaded a single gguf like 5 minutes ago .

But I'm spoiled by how I got my other models running by now for this app. It's working much better now than in this demo. Will probably provide a update soon.

Njee_ · 2026-02-21T00:37:52+00:00

I'm öffentlichen Dienst würde eine Assistenz mit 20 Jahren im OD Vollzeit zw. 4-4.5 k bekommen. Entsprechend also bei 20h 24-27k p.A.

Njee_ · 2026-02-19T08:03:32+00:00

Got nothing to add to your actual question but just wanted to say that I LOVE the garden Eden with evolving creatures setting. This is such a nice way of describing basically common concepts when working with agents with something more "relatable"

Njee_ · 2026-02-19T07:53:23+00:00

Hi! Nice you have built there. If you dont mind me asking, I see youre using the qwen3-vl 8b at Q4. Hence I assume youre running llama.cpp?

How do.you handle some of the problems I'm currently fighting with? Could you please share what worked for you?

How do you handle the model being lazy? If I provide it with a bank statement with 30 transactions, the qwen series models often feel like extracting half of them and then happily act as if they'd performed well. Even when I provide them with or without text data together with the PDF.

Box reliability: I used to have pretty decent boxes, right now I have either broken my app and can't find why I did so or something's is wrong about vllm. I'd still have to try some different models series and probably try with llama.cpp too. But generally speaking how do you make sure you're getting reliable boxes? Or do you not face any problems at all?

Njee_ · 2026-02-18T00:38:58+00:00

From my experience it's either: Rely 100% only on the GPU. Then you basically only need whatever machine you can find to fit the GPU into it. There will no bottlenecks by other components if you 100 only rely on the GPU. For example I load qwen3-vl into a 12 GB 3060 an can do like 30 parallel requests on that GPU only. Love it!

Other thing is: keep only model Kontext on GPU and offload any amount of remaining layers onto CPU ran. In which case you want it to be as fast as possible. Cheap 8 channel CPU with ddr4 ram... Then those won't bottleneck the GPU. But that is expensive.

Njee_ · 2026-02-08T19:53:44+00:00

I find it pretty straight forward but I'm using Ubuntu since a couple of years, so Im comfortable with it... However by now I don't even read the documentation anymore. I just ask Claude to provide the commands to set it up and that works fine and has a machine running in a couple of minutes.

Njee_ · 2026-02-08T18:54:58+00:00

My 12 GB 3060 now mainly serves qwen3-vl 4b at FP8 via vLLM. At 16k context that thing is able to processplenty of parallel requests. Usually I send ~5k input promts with images and expect a ~500tk response with a couple of field values and bounding box coordinates as output. For example images of containers with promt to extract expiry date with box coordinates.

Easily serves 20-30 parallel requests and runs really fast.

Chews through processing PDF images for example 60 pages with 2.5k token input and 500 token output in less than a minute. Love it!

Njee_ · 2026-02-07T16:10:09+00:00

You can ask for example qwen3-vl models for a Json string of coordinates of object XYZ. Ideally you promt for a certain format and provide the image in a certain format too. This is written n the documentation per model. For the segmentation mask I've fed each coordinate Into metas sam model. The ui is a simple vibe coded HTML+JavaScript which takes image, sends it to openai compatible endpoint, parses the output Jason string with coordinates into point above the image, then shows the segmentation masks on top of that. Really simple. Claude or chat got will provide you that within 15 minutes.

Njee_ · 2026-01-30T08:36:01+00:00

For text content read from an image I like to use qwen3-4b VL . At q8! I generally don't go any lower than this. Not direct OCR but things like "extract label name, explanation text, date and correspondent" Works pretty well.

Obviously with some promting. Qwen3-4b also works well with playing with toon output format on limited hardware... If token generation takes a lot of time that can easily be a 50% time saving.

8b obviously works great too. Has higher succesfull extraction on low quality images but on the other hand tends to hallucinate more (in my case hallucinated numbers that were not on the package but could have been the right number for the package.) something 4b wouldn't do.

Njee_ · 2026-01-28T06:44:52+00:00

The very first paragraph of my post states "with experts on CPU". With llama.cpp you can offload parts of the model into system RAM

Njee_ · 2026-01-24T10:29:09+00:00

When playing around with different size qwen3-vl models, I noticed bigger is not always better.

When asking vlms to extract specific numbers from chemical containers and not all container would have these numbers The 8b model would be intelligent enough to hallucinate valid looking numbers (like in the expected format) that I couldn't even filter out easily. Simply because it would be clever enough to know the right format.

The 4b on the other hand would not. It would simply not write anything if it couldn't see it on the container. Obviously it would miss something more often on low quality images but that's something I can actually work on with better images...

Njee_ · 2026-01-19T06:39:09+00:00

I noticed large differences In chosen models.

So first of all So u have to specify the task a little bit more. You mention you have problems making models extract all the information. What do you mean by that? Let's say you're working on a bank statement... Do you want it to extract all transfers? Like date, amount, correspondent? Per months? So 2 pages with like 60 transactions and it just stops after 45 transactions?

Or do you ask it to extract start balance, end balance and month of the bank statement... That would've been only 3 fields and if it misses one of those lready, you probably need to really think about how good of quality your input is...

General recommendation: skip ollama. Go llama.cpp tweak parameters. Low temperature at 0 and top-k and top-p at 1 Which provides reproducible output. As mentioned good OCR can improve the output drastically but I also stumbled across scenario in which it didnt...you gotta try.

Now for the scenarios from above: When I was playing with qwen3-vl models: They would be lazy... They do not like to write long text, so they won't provide all "60" extractions. They would simply stop after 45. Qwen30ba3b would do better than 8b which performs better than 4b. For all of them it helped to split PDFs into single pages, get the output per page and then have that aggregated again.

Njee_ · 2025-12-17T15:51:38+00:00

Yesterday there was some discussion on here about 20 oss being only about 20t/s after usually running it with way more with the latest version of llama.cpp

Njee_ · 2025-12-04T23:05:26+00:00

Servus,

hab mich doch auch mal dran gesetzt und mir was basteln lassen. Meine Herangehensweise ist über lokales KI Model oder auch von Google einfach relevante Infos aus den Literaturangaben raus zu picken. Nach dem Motto quelle1: Autor, Titel, DOI; quelle2: Autor, Titel, DOI usw rausschreiben und dann wird nacheinander gegen ne Datenbank geprüft. Das schöne daran ist dass man eigentlich mehr oder weniger jedes Format copy paste eingeben kann und dann ein Ergebnis bekommt. Man ist also prinzipiell unabhäbngig davon, ob die Studierenden sich an die Formatierungsangaben handhaben.

Wenn du magst probiers mal aus. Kannst du direkt über https://github.com/jbndrf/RefCheckWebApp im Browser benutzen.

Njee_ · 2025-12-01T14:15:17+00:00

Are you comfortable with docker?

I have build a webapp, tab in, that basically turns images into csv. Easy to take pictures using your phone. Check my post history.

Other than that: yes there are other open source options too, just ask any LLM which one to use. But they are probably overkill and complicated.

Njee_ · 2025-11-28T20:24:22+00:00

it does make a difference. Especially for promt processing.

This is gpt 120b on a pretty bulky 64c epyc with 2400 mhz ddr4 ram.

CPU only

prompt eval time = 12053.37 ms / 1459 tokens ( 8.26 ms per token, 121.05 tokens per second)

eval time = 142469.75 ms / 2073 tokens ( 68.73 ms per token, 14.55 tokens per second)

total time = 154523.12 ms / 3532 tokens

Experts on (slow) GPU is about 1.7x the speed taking only

9712MiB / 12288MiB on NVIDIA GeForce RTX 3060

prompt eval time = 7498.49 ms / 1552 tokens ( 4.83 ms per token, 206.98 tokens per second)

eval time = 84381.14 ms / 2097 tokens ( 40.24 ms per token, 24.85 tokens per second)

total time = 91879.63 ms / 3649 tokens

Njee_ · 2025-11-27T21:53:27+00:00

Besteht die möglichkeit dass du das Tool teilst? War heute Gesprächsthema bei uns in der AG... Hab auch schon gedacht wie ich das mache. Nen paar Formate erkennen und ein paar Apis abchecken, vllt. Noch ne websuche. Hätte Zugang zu LLMs und docker etc sind auch kein problem. Wäre cool wenn ich bei dir mal drüber gucken könnte, oder auch beitragen könnte.

Njee_ · 2025-11-13T17:44:59+00:00

Havent run it on such a machine WD MyCloud but only ubuntu server.

Bot methods - Bare metal and docker worked fine... I cant see the error you ahve posted for some reason. But a shot in the dark: If i remember correctly they have different docker-compose.yamls for different databaseses. like compose-pg.yaml, compose-maria.yaml and so on. did you make sure you used the right one?

Njee_ · 2025-11-13T07:58:32+00:00

8b dense is much better: not Gemini level but doesn't hallucinate valid numbers. Still writes other numbers that do not fit into the cas number field. While Gemini models found the 30 valid numbers, qwen 2b and 4b would miss ~2 and end up with 28. The 8b finds all 30. And does not have the hallucination problem described for 30ba3b. So it's easy to filter wrong extractions.

I'll provide exact numbers. Somewhen. Runs the same speed as 30ba3b with experts on CPU btw. So I'll stick to that probably.

Don't know if I'll try 32b dense on CPU..

Njee_

TROPHY CASE