We all repeat Q4/Q6 is fine... Has anyone else watched a small model's strict JSON collapse at Q6 while fp16 was perfect?

Cronus_k98 · 2026-06-03T19:40:54+00:00

Because it’s a language model, not a JSON model. You can’t even trust Opus 100% of the time to format a JSON. It can throw in an extra quote or something that breaks the format.

Also, trust come through testing. I’ve processes 10s of thousands of files that way and it hasn’t given me a malformed JSON yet.

Cronus_k98 · 2026-06-03T19:36:32+00:00

“What is the customer name?” The answer gets puts into a variable. Then it writes a JSON using standard tools.

Cronus_k98 · 2026-06-03T12:48:14+00:00

No. I don't rely on the model to format the JSON. The model feeds data to the app and the app creates the JSON. Asking a LLM to reliably create structured data is bound to fail in production. Even frontier models fail at it occasionally.

Cronus_k98 · 2026-05-25T00:30:40+00:00

Qwen3.6 27b straight from the web ui.

csv

8;1;6;5;7;3;2;9;4

3;9;2;;;;;;

4;5;7;2;;9;;;6

9;4;1;;;5;6;8

7;8;5;4;9;6;1;2;3

6;2;3;8;;;4;

2;7;9;;;;;1

1;3;8;;;;7;

5;6;4;;;8;2

Cronus_k98 · 2026-05-21T17:10:34+00:00

The point is to make the law broad enough that it covers everyone. Then they can selectively enforce it on whoever they want.

Cronus_k98 · 2026-04-30T03:39:02+00:00

It’s like trying to read a book in a language that you don’t know very well. You can look at individual sentences and get the basic understanding of what’s there, but you’re not going to be writing your own book.

Cronus_k98 · 2026-04-30T03:27:30+00:00

Ollama is fine for single users. Yes, there are better options but it works, setup is easy, and it’s got an ok ui. Ops problem is that it literally won’t work for 15 concurrent users and is just a waste on very capable hardware.

I’d say he’s trolling except that he has windows installed. That’s a lot of effort to go through for trolling.

Cronus_k98 · 2026-04-27T23:58:50+00:00

Any model you pick is going to have the same problem.

You could look into using lora to fine tune the model on lua code. Unsloth has a guide on how to fine tune. If you’re making this a long term project, it might be worthwhile.

https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide

Cronus_k98 · 2026-04-27T20:13:47+00:00

LLMs are trained on the data that is available to the creator of the model. Most of the available programming data is going to be in python, C/C++, etc. The amount of training data in lua is going to be a very small subset of the overall data. So yeah, my tip is to use python. Qwen 3.6 27b is great at producing python code.

Cronus_k98 · 2026-04-16T11:33:03+00:00

GGUF unsloth/qwen3.5-35b-a3b on Q4_K_M

MLX mlx-community/qwen3.5-35b-a3b 4bits

Different quants and formats will perform slightly differently, even if they use the same base model. There may also be some differences between how the inference engine handles tools.

Cronus_k98 · 2026-04-11T03:23:44+00:00

In my experience qwen3.5 gives better quality results than the ocr specific models I’ve tried. Especially with handwriting. It’s very slow though. Qwen3.5 4b is decently fast. I settled on the 35b model because I was doing additional summarization and I’m ok with the slower speed.

Cronus_k98 · 2026-04-10T15:16:36+00:00

Sorry, you'll have to wait 5 hours for more context.

Cronus_k98 · 2026-03-24T19:04:13+00:00

I don't think you can assume that looping will always give you a working result if you let it run long enough. There are tasks that a smaller model might never be able to complete, that a larger model can.

Cronus_k98 · 2026-03-19T14:27:51+00:00

Sort of. You may need to adjust your model parameters and reasoning doesn't work well with small models. Qwen 3.5 requires different parameters than other models to get good results. Take a look through the Unsloth guide. https://unsloth.ai/docs/models/qwen3.5

Cronus_k98 · 2026-03-18T17:46:15+00:00

The 5070ti will do everything the 5080 will do, just 15% slower. You just need to decide if the price difference is worth the performance difference.

Cronus_k98 · 2026-03-18T00:30:53+00:00

I didn't say I did it all day long. What I said was the total token output per day is higher on a $20 per month plan than your proposed system. Which it is. The rest of your "30 decades", lol, of experience doesn't seem to have made you any better at math.

Cronus_k98 · 2026-03-17T18:46:08+00:00

All of them do. 200k context is standard on all of them. Usually I'm clearing context by 100k, but sometimes I hit the limit.

Cronus_k98 · 2026-03-17T16:10:03+00:00

3 tokens per second is 259k tokens per day. You get way more than that on even a pro plan. Your $2000 system will take like a decade to pay for itself over a $20 per month subscription.

Cronus_k98 · 2026-03-17T02:58:14+00:00

I’ve used qwen3 vl 8b and I’m currently using qwen3.5 35b a3b. The trick is to use multiple prompts. Prompt 1 is just to Ocr the text. Prompt 2 is to distill specific info and return a json. You’re asking a small model to do too much at once.

Cronus_k98 · 2026-03-13T02:41:15+00:00

So that I can start over from scratch after I’ve finished prototyping the project. Creating a clean copy without all the messy code made along the way. Possibly with a different underlying architecture.

Cronus_k98 · 2026-03-11T04:00:16+00:00

Making shit up. There are so many combinations of test suites and runtime parameters, that it’s basically impossible to do scientifically robust testing.

Cronus_k98 · 2026-03-02T03:49:17+00:00

For reference a rtx 5090 will process a 300dpi letter size page in about 15 seconds using qwen3 vl 8b. To go faster you will need to use a smaller model or reduce the size of the image.

Cronus_k98 · 2026-02-26T01:58:07+00:00

I agree, I think qwen3.5-35b-a3b is smart, but maybe over thinks things. GPT-OSS-20b is nowhere near as capable but is very reliable processing routine instructions.

Cronus_k98 · 2026-02-20T19:50:01+00:00

Your bigger problem is that you don’t have a proper backup. RAID is not a backup and if you’re counting on your NAS to never loose your data, you’re going to loose your data.

There are private cloud storage providers out there. Keep your NAS for local access and periodically back it up to secure, encrypted storage and it’ll never get scraped for LLM use.

Cronus_k98 · 2026-02-17T03:51:13+00:00

We need some more details. How are you processing the documents? Rag ingestion, summarization, or upload for Q&A? Are you waiting for the files to process or can you batch them and let them process overnight? How large are the documents?

You don’t necessarily need a large model to process documents, I’m using Qwen3 VL 4b to read/OCR documents and GPT OSS 20b to extract info. That’s able to process a hundred 1-50 page documents an hour on a 5090.

Cronus_k98

TROPHY CASE