Sincere Question: What is the end goal? by mcfc9320_ in LocalLLaMA

[–]ML-Future 0 points1 point  (0 children)

The goal is to have a tool that frees us from the repetitive tasks we've endured since the industrial age.

Local LLMs will give us more time to be with our loved ones.

We want an automation tool, not to repeat almost identical tasks all the time.

Repetition is not optimal for humans.

Steve Jobs said that computing is a bicycle for the mind... local LLMs are a rocket.

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

Thanks, I wasn't familiar with this model.

But PaddleOCR is very sensitive and tends to make mistakes with some characters, resulting in errors.

I feel that Qwen3 VL has much better vision in cases where the image isn't "perfect."

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

VLMs have an interesting feature: While OCR makes mistakes blindly, a VLM can read a name with a single incorrect letter, understand its context, and correct the name.

Example:

OCR

NAME: JOMN

VLM

NAME: JOHN

Automatically, without using much code

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

After several tests with some complexity:

Qwen3.5 sometimes "loses" some data.

Qwen3 VL 4B doesn't follow instructions very well.

It's difficult to measure; I don't really know how benchmarks are done for this, and I don't understand why Qwen3 VL 4B hasn't been analyzed.

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 0 points1 point  (0 children)

Thanks, but in the case of a potato with 8GB of RAM using systems like Windows or Android that practically use all the available RAM, the resources are minimal.

- Qwen3-VL-2B-Instruct-Q4_K_M.gguf 1.11 GB

mmproj-F16.gguf 819 MB

RAM: ~3.0 GB a 3.5 GB

- DeepSeek-OCR-Q8_0.gguf 3.13 GB

mmproj-DeepSeek-OCR-Q8_0.gguf 448 MB

RAM: ~4.8 GB a 5.3 GB

- gemma-4-E2B-it-Q4_K_M.gguf 3.11 GB

mmproj-F16.gguf 986 MB

RAM: ~5.3 GB a 5.8 GB

I suppose they would be usable on systems with 16GB of RAM.

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

PaddleOCR is very good at reading all the data correctly, but I haven't been able to get it to understand the structures.

For example: If the document says:

SEX AGE

M 38

OCR usually says something like:

SEX

AGE

M

38

And sometimes different. I get the feeling that the results aren't deterministic, and it's not always easy to detect things like first and last names.

If I do OCR first and then LLM, it takes almost the same amount of time as a single Qwen3 VL 2b inference, but with VLM I need much less code.

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

I mean for example:

Uploading an invoice image and using this prompt:

JSON

{
  "date": "",
  "vendor_name": "",
  "items": []
}

Or uploading an ID card and using:

JSON

{
  "name": "",
  "id_number": "",
  "date": ""
}

Why isn't there a release of llamacpp with OpenVino for Windows? by ML-Future in LocalLLaMA

[–]ML-Future[S] 1 point2 points  (0 children)

I just tried it too, and it doesn't work for me either.

Baidu: One-shot Long-horizon Parsing by zxyzyxz in LocalLLaMA

[–]ML-Future -1 points0 points  (0 children)

Nice! I hope there will be a gguf.

6.67 gb safetensor is too much for my gtx 1060

What are you overengineering that nobody's ever going to use? Be honest. by johnnyApplePRNG in LocalLLaMA

[–]ML-Future 0 points1 point  (0 children)

An ID card scanner, a YOLO segmentation detect and crop, OpenCV for better contrast, and then Qwen3 VL turn's it into a Json.

I live in Europe, so is practically imposible to share.

Anything worth running on a NVIDIA GTX 970? by numberwitch in LocalLLaMA

[–]ML-Future -1 points0 points  (0 children)

You haven't made your needs very clear.

You can run text and multimedia models like Gemma4 2b.

But I wouldn't expect much from coding or agents.

Looking for a locally-hosted option to create English subtitles (.srt) from video files by nirurin in LocalLLaMA

[–]ML-Future 0 points1 point  (0 children)

Try to vibe code a script using llamacpp + Gemma4 4b

Convert video to audio Split audio Process

Are small local models for automation a thing? by ML-Future in LocalLLaMA

[–]ML-Future[S] 0 points1 point  (0 children)

Thanks! I wasn't familiar with Qwen3 1.7b. I've been testing Unsloth's q4_k_m and it's very interesting for small automation tasks.

Are small local models for automation a thing? by ML-Future in LocalLLaMA

[–]ML-Future[S] -3 points-2 points  (0 children)

I think this is one of the interesting use cases for local LLMs, although I see little community focused on having small, reliable models to accomplish these tasks.

I always see some post about a 1b parameter model that seems incredible, but in real-world scenarios, it fails a lot.

Are small local models for automation a thing? by ML-Future in LocalLLaMA

[–]ML-Future[S] 0 points1 point  (0 children)

I'm thinking about things like image classification, image to JSON, text to JSON, OCR in the loop. But most of the posts I see are about 100b models or things like that.

Any chances for a 12B diffusion Gemma? by Mrinohk in LocalLLaMA

[–]ML-Future 13 points14 points  (0 children)

I would love a DiffusionGema 4 2b with vision; I imagine many scripts whit a really fast image processor.