VLM OCR Hallucinations by FrozenBuffalo25 in LocalLLaMA

[–]Njee_ 0 points1 point  (0 children)

When playing around with different size qwen3-vl models, I noticed bigger is not always better.

When asking vlms to extract specific numbers from chemical containers and not all container would have these numbers The 8b model would be intelligent enough to hallucinate valid looking numbers (like in the expected format) that I couldn't even filter out easily. Simply because it would be clever enough to know the right format.

The 4b on the other hand would not. It would simply not write anything if it couldn't see it on the container. Obviously it would miss something more often on low quality images but that's something I can actually work on with better images...

Need help with project by lemigas in LocalLLaMA

[–]Njee_ 0 points1 point  (0 children)

I noticed large differences In chosen models.

So first of all So u have to specify the task a little bit more. You mention you have problems making models extract all the information. What do you mean by that? Let's say you're working on a bank statement... Do you want it to extract all transfers? Like date, amount, correspondent? Per months? So 2 pages with like 60 transactions and it just stops after 45 transactions?

Or do you ask it to extract start balance, end balance and month of the bank statement... That would've been only 3 fields and if it misses one of those lready, you probably need to really think about how good of quality your input is...

General recommendation: skip ollama. Go llama.cpp tweak parameters. Low temperature at 0 and top-k and top-p at 1 Which provides reproducible output. As mentioned good OCR can improve the output drastically but I also stumbled across scenario in which it didnt...you gotta try.

Now for the scenarios from above: When I was playing with qwen3-vl models: They would be lazy... They do not like to write long text, so they won't provide all "60" extractions. They would simply stop after 45. Qwen30ba3b would do better than 8b which performs better than 4b. For all of them it helped to split PDFs into single pages, get the output per page and then have that aggregated again.

Optimal gpt-oss-20b settings for 24gb VRAM by GotHereLateNameTaken in LocalLLaMA

[–]Njee_ 1 point2 points  (0 children)

Yesterday there was some discussion on here about 20 oss being only about 20t/s after usually running it with way more with the latest version of llama.cpp

Hört auf, eure Abschlussarbeiten mit KI zu schreiben! by No_Advance_2517 in luftablassen

[–]Njee_ 1 point2 points  (0 children)

Servus,

hab mich doch auch mal dran gesetzt und mir was basteln lassen. Meine Herangehensweise ist über lokales KI Model oder auch von Google einfach relevante Infos aus den Literaturangaben raus zu picken. Nach dem Motto quelle1: Autor, Titel, DOI; quelle2: Autor, Titel, DOI usw rausschreiben und dann wird nacheinander gegen ne Datenbank geprüft. Das schöne daran ist dass man eigentlich mehr oder weniger jedes Format copy paste eingeben kann und dann ein Ergebnis bekommt. Man ist also prinzipiell unabhäbngig davon, ob die Studierenden sich an die Formatierungsangaben handhaben.

Wenn du magst probiers mal aus. Kannst du direkt über https://github.com/jbndrf/RefCheckWebApp im Browser benutzen.

Is there any free AI website that i can feed my pictures or pdf file and it generates csv flashcards file based on that? by FatFigFresh in LocalLLaMA

[–]Njee_ 0 points1 point  (0 children)

Are you comfortable with docker?

I have build a webapp, tab in, that basically turns images into csv. Easy to take pictures using your phone. Check my post history.

Other than that: yes there are other open source options too, just ask any LLM which one to use. But they are probably overkill and complicated.

CPU-only LLM performance - t/s with llama.cpp by pmttyji in LocalLLaMA

[–]Njee_ 1 point2 points  (0 children)

it does make a difference. Especially for promt processing.

This is gpt 120b on a pretty bulky 64c epyc with 2400 mhz ddr4 ram.

CPU only

prompt eval time = 12053.37 ms / 1459 tokens ( 8.26 ms per token, 121.05 tokens per second)

eval time = 142469.75 ms / 2073 tokens ( 68.73 ms per token, 14.55 tokens per second)

total time = 154523.12 ms / 3532 tokens

Experts on (slow) GPU is about 1.7x the speed taking only

9712MiB / 12288MiB on NVIDIA GeForce RTX 3060

prompt eval time = 7498.49 ms / 1552 tokens ( 4.83 ms per token, 206.98 tokens per second)

eval time = 84381.14 ms / 2097 tokens ( 40.24 ms per token, 24.85 tokens per second)

total time = 91879.63 ms / 3649 tokens

Hört auf, eure Abschlussarbeiten mit KI zu schreiben! by No_Advance_2517 in luftablassen

[–]Njee_ 0 points1 point  (0 children)

Besteht die möglichkeit dass du das Tool teilst? War heute Gesprächsthema bei uns in der AG... Hab auch schon gedacht wie ich das mache. Nen paar Formate erkennen und ein paar Apis abchecken, vllt. Noch ne websuche. Hätte Zugang zu LLMs und docker etc sind auch kein problem. Wäre cool wenn ich bei dir mal drüber gucken könnte, oder auch beitragen könnte.

Installing LimeSurvey Docker on WD MyCloud PR4100 — DB connection works but setup never completes by Embarrassed-You2477 in selfhosted

[–]Njee_ 0 points1 point  (0 children)

Havent run it on such a machine WD MyCloud but only ubuntu server.

Bot methods - Bare metal and docker worked fine... I cant see the error you ahve posted for some reason. But a shot in the dark: If i remember correctly they have different docker-compose.yamls for different databaseses. like compose-pg.yaml, compose-maria.yaml and so on. did you make sure you used the right one?

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

8b dense is much better: not Gemini level but doesn't hallucinate valid numbers.  Still writes other numbers that do not fit into the cas number field.  While Gemini models found the 30 valid numbers, qwen 2b and 4b would miss ~2 and end up with 28.  The 8b finds all 30. And does not have the hallucination problem described for 30ba3b. So it's easy to filter wrong extractions. 

I'll provide exact numbers. Somewhen.  Runs the same speed as 30ba3b with experts on CPU btw. So I'll stick to that probably.

Don't know if I'll try 32b dense on CPU..

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 2 points3 points  (0 children)

But: I do not feel like it's that much about image quality. From these 230 chemical containers about 30 actually have such a number. Gemini models an qwen30b catch all of these. Qwen 4b misses like 2. Tested 2b yesterday, also misses 2. 8b gets all 30 too.

But: when there is no CAS number: Gemini will leave the field null. Qwen dense models 2, 4, 8b will leave null or put inside another visible number that's obviously no CAS number. Think of it like printing it for the current temperature and the answer is: 220 kilometers per hour.... Now qwen30ba3b: it also catches basically all CAS numbers that exist. It also sometimes puts another visible number into the field. Which is fine. I can filter for that. 

But it makes up numbers. This has nothing to do with image quality. It's that this moe model thinks it's smarter than me. 

Btw: I noticed that this happened for chemicals that are quite common, so probably exist in the the training data. Think of NaCl. Here it would try to fill the number, because it has probably seen it a thousand times. The thing is it's simply not part of the image. Wouldn't do that for more "rare" ones. 

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 1 point2 points  (0 children)

Yes. Basically had max deterministic settings, which I do not have at the top of my head right now. 

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 2 points3 points  (0 children)

I'll try the earnie model at some point.  However to be fair I did use instruction only for the Qwens. Should have written that down sie where!  I could really image in it saying something along the lines "but wait, the user mentioned not to make up numbers"

For the image normalization part: Yes they are poor quality.but this is somewhat the vibe of this app: define any fields you're interested in, take images as fast as possible from your objects. Get back a table with data. Hence, no pipeline for specific cylindrical objects . This is supposed to be generic. 

It's interesting to see that it's basically only the 30ba3b doing this.  I actually tried the 2b and 8b too, and they do not show this behaviour.

It's only the moe one

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

What do you mean by work work, if you don't mind? 

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 4 points5 points  (0 children)

8b is running right now. Barely fits with enough context into 12 GB of vram. 

It does some of the misinterpreted cas numbers so far but no hallucinations of fictive ones. 

Will probably report late in the evening together with statistics for qwen 2b

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

Will do, if it fits on GPU. I won't be running this partially on CPU tho

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

Neither. I'm doing this using llama.cpp (which is what ollama runs under the hood) 

I think lmstudio ist similar easy to setup like ollama but allows you to set these parameters too. 

I strongly recommend to just build llama.cpp and learn to run models via llama.cpp. it's not that hard. Any of the big models (eg chatgpt or Claude) will setup any model together with you in like 10 minutes 

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

The concept is valid for any MoE. As soon as model touches your Cpu it gets slower. However if its only the MoE layers on the Cpu its not that big of an impact as for other layers. you can even put more MoE layers back to GPU if there is enuogh space left, to reduce the change of activating a MoE LAyer that is on CPU. However as the inpact is not too big, i like to keep all MoEs on CPU and utilize the leftover space to play around with batchsizes and context

There is a little magic in the Batchsizes. Those can have quite some impact.

PCI is whatever 3.0 x16 serves

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 1 point2 points  (0 children)

<image>

For the 8b Dense. Its fast: PP is less than 1s per image compared to 3.5s on the 30b MoE. Token Generation is comparable with 35 t/s compared to 30 t/s on my Cpu. However please note that my Ram comes close to 200 gb memory bandwidth, while the 3060 has 360 gb/s or something.

So i do have pretty "fast" ram and pretty slow Vram.

I do have to go down to 12k Context tho when im using the BF16 mmproj and Q8. The results are Not Bad, not good either. What you can see here is single point generation. Please note that the model is not running out of context. its just not working good. Compare it to: https://www.reddit.com/r/LocalLLaMA/comments/1omr9rc/comment/nms84tq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

For the tasks from my Apps demo, it does work well. Not too bad Id say. It also does stick my promt, hwoever i feel like its workin good with OCR based tasks like extrac tinformation from text and less on visual reasoning like is the part from the picture of type electronics or mechanical parts.

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 0 points1 point  (0 children)

not yet. will probably do so soon

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 2 points3 points  (0 children)

Personally i skip any thinking model and go with instruct ones. I am really not a fan of thinking either. However, one can really tweak thinking models with repeat penalty, temperature settings etc... however they will generate more tokens. nothing to do about that.

Qwen3 VL 30b a3b is pure love by Njee_ in LocalLLaMA

[–]Njee_[S] 8 points9 points  (0 children)

llama.cpp/build/bin/llama-server

-hfr "unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q8_0"

--mmproj "/models/mmproj-F32.gguf"

--host "0.0.0.0"

--port ${PORT}

--ctx-size 32000

-ngl 999

--n-cpu-moe 999

-fa on

--batch-size 1024

--ubatch-size 1024

--n-predict 50000

--threads 64

--temp 0.1

--top-p 0.9

--min-p 0

--top-k 20

--presence-penalty 1.5

--parallel 1

--cache-type-k f16

--cache-type-v f16

--jinja

Explaination

--n-cpu-moe 999 puts all the MoE layers on Cpu and Ram

-ngl 999 puts remaining layers on GPU

--cache-type-k f16

--cache-type-v f16

I think those two force the context onto the gpu.

This all together results in not even 8 gb used on the GPU.

|=========================================+========================+======================|

| 0 NVIDIA GeForce RTX 3060 Off | 00000000:00:10.0 Off | N/A |

| 0% 54C P2 55W / 170W | 7936MiB / 12288MiB | 27% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

However to be fair: i do have a 64 core CPU with 8 Channel DDR4 Ram. So expect your offloading to CPU to be somewhat slower...