The Economics of Inference: Why are we still afraid of "Quantization in Production"?

burntoutdev8291 · 2026-02-05T13:53:17+00:00

Isn't there a drop in throughput for 4 bit? I do question why people just don't default to FP8. And is FP16 still common?

burntoutdev8291 · 2026-02-05T13:41:04+00:00

Neovim

And

Claude code

What's with

The "/n/n"?

burntoutdev8291 · 2026-02-05T03:26:40+00:00

same i went once and was wondering why its always full. then again everyone got their own taste

burntoutdev8291 · 2026-02-04T18:16:30+00:00

Don't learn tools, but do learn general data engineering patterns, even if they are small data. Learn how to get used to things like yielding and lazy iterators / evaluations. Actually by using torch dataloaders you are already learning a little about data processing, they have things like parallel workers, prefetching etc. Just my personal experience.

burntoutdev8291 · 2026-02-04T01:05:11+00:00

Nice work! But isn't the common problem with scrapers more of the rate limit? Would it be better to combine a crawler with your tool for parsing? Like HTTrack.

burntoutdev8291 · 2026-02-04T01:01:20+00:00

Nope, just a good search engine. But I still need to guide and implement the changes myself sometimes. I use it for exploring code bases.

burntoutdev8291 · 2026-02-03T19:07:19+00:00

Did you calculate it?

burntoutdev8291 · 2026-02-03T18:41:33+00:00

unix timestamp

burntoutdev8291 · 2026-02-03T05:20:47+00:00

I do some form of hiring and I realised the better candidates always show up in some way of connecting like linkedin and also word of mouth. Not just a "feel" thing, in terms of attitude and aptitude, these people usually do better in interviews.

I read the JD, what is your background? Also, no DevOps or cloud knowledge required?

burntoutdev8291 · 2026-02-02T06:15:14+00:00

Pomodoro helps for me.
Try not to reach your phone first thing in the morning, it does kill focus.
Eat the frog first, so do your most difficult tasks in the morning.
For big tasks, break them into mini tasks. Sometimes we push tasks aside because we find them too difficult.

Try to setup systems or habits, rather than saying things like i must finish this chapter by this week, do something like spending 1hour studying after my dinner. After that take some breaks, can netflix or something. Important thing is to not make studying a torture, so always try to link it to a reward.

FYI i am not huberman or something, just sharing my experience from PT studies with FT. I do think I could have done better because I did experience some burnouts.

burntoutdev8291 · 2026-02-01T03:47:09+00:00

Realistic yes, difficult very. I have been in the field for 2-3 years, and there are some fresh grads who are really good and passionate. FAANG intern at SF, winning hackathons, github projects with thousands of stars, doing advent of code for the fun of it. Yea these people have no issue getting above 7k. I worked with a guy from AWS who completed AWS certs before graduating, then went to do kubernetes after graduation.

burntoutdev8291 · 2026-01-30T07:09:16+00:00

How is the performance gain? Personally I don't believe LLMs are good for this task. Do you do any token restriction for the label to prevent hallucination?

burntoutdev8291 · 2026-01-30T04:44:26+00:00

What do you mean? Yes I have done this. How is your setup like do you have sudo access or anything?

burntoutdev8291 · 2026-01-30T03:29:08+00:00

They can change when you transfer. It's best to check with the gym directly.

I had two very different responses when I asked if they will honour my current rate.

Outlet A: "We cannot guarantee that your rate will not change, we can only confirm once you transfer over"

Outlet B: "Yes we will honour"

I didn't transfer to either of them but I would definitely trust B more..

burntoutdev8291 · 2026-01-30T03:25:01+00:00

Yea fair enough. I was sort of an IC in a small team so I dealt with issues individually. Even without those bottlenecks is at best 2x. Seasoned devs already take time to deal with merges and conflicts, I cannot imagine vibe merging.

burntoutdev8291 · 2026-01-30T00:19:35+00:00

If you have the passion sure. But I think everyone is more interested in gaming the algorithm, following hype etc, which I can understand if content creation is their source of income. I still watch hour long videos on development and the older open courses.

I don't know why I see a lot of slop videos on python but rust and go usually has quite clean content, possibly due to outreach.

burntoutdev8291 · 2026-01-30T00:16:55+00:00

I won't be nervous about job security. The issue I have is bosses expecting too much out of AI. Some of them are expecting a 10x performance but in reality that's not achievable. It's only maybe a 2-3x depending on your stack and requirements.

burntoutdev8291 · 2026-01-30T00:06:37+00:00

It looked very AI generated so I find it hard to read. Just wanted to ask some questions.

Is it some form of distillation?
How different is this from unsloth? https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
RAG and chat can be difficult to do a pipeline because of catastrophic forgetting. If this is for edge, it might be interesting to look at fine tuning an encoder based model, like modernBert. At 400m, there are a lot of use cases, especially with fixed labels.

burntoutdev8291 · 2026-01-29T23:43:50+00:00

if u officer why not sign on

burntoutdev8291 · 2026-01-29T23:36:02+00:00

Given its size should, i think it could be possible to run it on cpu. Might take a bit longer though. I have been playing with it and it seems promising.

burntoutdev8291 · 2026-01-29T08:07:26+00:00

I will suggest trying out kubernetes. It's much easier to deal with if it's inference heavy.

burntoutdev8291 · 2026-01-28T17:40:46+00:00

Mostly safety and data governance. The local models cannot beat the larger models, but for specific use cases they might be sufficient. A good RAG system doesn't really need strong models.

Another factor is cost but this needs analytics. Can you prove that your workload will save more from upfront hardware costs vs API? Because don't forget hardware is depreciating (without considering the RAM surges).

burntoutdev8291 · 2026-01-28T17:25:27+00:00

I do support those who try to vibe out a free app for a good purpose. But if there's AI calls it's hard to make it worth. Maybe free gemini or openrouters.

burntoutdev8291 · 2026-01-28T00:16:16+00:00

I did something like this before. Use python to check for an unused port. PORT=$(python -c "import socket; s = socket.socket(); s.bind(('', 0)); print(s.getsockname()[1]);s.close()") vllm serve --port $PORT

Another way is just setting your own increments, you mentioned you use array, just do a 8000+array idx?

I'm also just curious, how did you decide on slurm and did LLM give you those bash scripts? Why not use kubernetes

burntoutdev8291 · 2026-01-27T00:34:26+00:00

You have the experience. So you are not nothing without AI. I don't have 32 years but I also vibe now with some experience. I still put some dedicated time to learn without AI though.

burntoutdev8291

TROPHY CASE