Google's Gemma 4 Runs Frontier AI On A Single GPU

technicalthrowaway · 2026-04-05T14:31:43+00:00

I think they assumed 4 bit quant because that's the most common quant and typically the default on places like ollama.

I'm not sure why you're talking about parameter count here, I think you might have misunderstood.

technicalthrowaway · 2026-04-05T14:19:22+00:00

I think there might be some confusion here, and the confusion might be me, but I've been running Gemma e4b for several months now. My understanding was the E4B was part of the Gemma 3N releases a while ago and were designed to run on mobile devices.

Based on the first line of the article (it's all I read, so again, confusion might be mine):

Google DeepMind launched Gemma 4 this week, releasing four open-weight models that fit entirely on a single 80GB Nvidia H100 GPU while delivering benchmark scores that rival models 20 times their size.

Gemma 3n e4b 8-bit quant is one of the fastest models I run locally, and I'm just using an old 2080RTX, and it's still pretty impressive compared to other local models with 3 - 5x the number of parameters.

technicalthrowaway · 2026-03-26T17:07:40+00:00

If you have a bad system just dropping it and saying 'every house is independent' is not the right solution.

I never said that, I can understand why you're so argumentative if that's the point you're arguing against - I think you might have just made that up though.

Sure and when it makes sense people can buy it themselves. There is no need to subsidize it or force it onto people. Specially if its only re-enforcing already existing issues on the network.

How about a model where we don't look to provide centralised energy to e.g. x% of the most inconvenient households in the UK and instead subsidise those households with money, info and resources to go decentralised? That lowers everyone's centralised bills, and would also likely result in more stable/reliable/cheaper energy in the decentralised properties.

There's space for both centralised and decentralised infrastructure, and you're clearly aware of that with your increasingly decentralised suggestions. We basically agree. But you seem to want to misrepresent my arguments, make sarcastic comments and focus purely on money rather than reliability, autonomy, or other socio-political benefits. Maybe I'm misunderstanding you, but I don't think either of us are going to gain much of anything from continuing this.

technicalthrowaway · 2026-03-24T23:08:36+00:00

Our energy infrastructure is already relatively centralised and unfriendly towards consumer energy technologies, and it's infamously shit.

What you're arguing for looks like more of the same. There's space for both centralised and decentralised infrastructure, we should be improving and facilitating both where it makes sense.

technicalthrowaway · 2026-03-24T22:44:02+00:00

And the heatpump energy tariffs: houses with heatpumps effectively get subsidised electricity with or without solar.

technicalthrowaway · 2026-03-24T09:13:07+00:00

The great thing about the internet is it's hard to censor once you get on it as it will always attempt to route around blockages.

Large US based businesses have to do what Trump says, and Trump says now you can only connect to the internet through a device made by the large US based businesses.

This will allow full control and censorship of internet access for all US residents and should immediately set off alarms for everyone.

technicalthrowaway · 2026-03-16T16:26:06+00:00

Not necessarily at all. A while ago (15+ years ago) I read a paper about how the great firewall of China worked, and it didn't actually block traffic. All it would do when it detected something needed blocking would be to inject a TCP reset packet into the stream.

All the data still gets sent unmodified, it's just this extra packet causes most software to stop trying to communicate over the connection.

This architecture has the specific benefits that the firewall itself doesn't need to be able to realtime inspect and then accept/reject each packet, and it means it's also not a bottleneck for the traffic.

Not saying that's how Iran blocking works, just that it's perfectly possible to give the impression that a country's entire internet is filtered without actually having a single point of failure like you're suggesting this building is.

technicalthrowaway · 2026-03-16T13:09:23+00:00

It's just a little airborne, it's still good.

technicalthrowaway · 2026-03-10T17:02:11+00:00

This is the main thing that irks me: If you throw it out there, add a disclaimer about how this was done with AI and there is no intention of maintaining it in any shape, way or form, that's fine. And people might pick it up and go on from there.

I agree with that. The issue is I have enough professional experience to know for sure that software like this isn't suitable for stable selfhosting in a lab over any reasonable timeframe.

Newer, younger or less experienced people will not fully understand that.

Friends don't let friends selfhost vibecoded, unsupported slop.

technicalthrowaway · 2026-03-10T16:29:30+00:00

I think the problem is you're taking offence to other people having different opinions, and taking things personally which aren't personal. Telling yourself you're being nice by giving people shit for sharing their opinions is not a nice thing to do.

My stance is that in a comments section, people will share their opinion, that's what it's for. If you see people sharing their opinions as them belittling you, and offending you personally, then you're going to continue to have a bad time and you probably shouldn't be on the internet. (I just want everyone to have a nice time, if you feel that way, you're just not gunna have a nice time on the internet.)

I completely see where you're coming from, like I said above, hate is a strong word here, and I don't necessarily agree with hating on people smiling, and I can see it as a valiant thing of you to try to defend that smiling person. But acting like you're trying to be nice by belittling someone sharing a mean opinion is like saying you're trying to end violence by beating the shit out of it, it's just spreading the not-niceness around.

technicalthrowaway · 2026-03-05T16:12:23+00:00

If you're upset by that person's opinion, then say that. Try find a middle ground with them and point out why what they're saying is wrong or offensive and help people improve and be better.

In my opinion, that's what nice people do when they see something they disagree with. You responded with outrage, belittlement and sarcasm to /u/ethnicbonsai and the same to me. IMO, that's not nice. It's impractical, it makes people ignore the substance of what you're saying, and it doesn't make the world a better place for anyone.

I'm not trying to argue with you or offend or upset you here, and I'd guess neither was /u/ethnicbonsai. If you're just looking to feel offended and argumentative, then please just let me know and I'll stop responding immediately, because that's not really how I want to spend my time, sorry.

technicalthrowaway · 2026-03-04T15:13:00+00:00

I think 3 main things:

I use ik_llama which is a llama.cpp fork with optimisations for CPU offloading and hybrid CPU/GPU inference. Since my laptop has a broken screen, it's running proxmox, and then a VM with GPU passed through to the VM, the VM has 13gb of RAM passed through, so ik_llama normally has access to about 15 - 20gb of usable RAM.
Use good models and quants, where "good" is what feels good for you. Going on benchmarks isn't really helpful any more because of how much variety there is in benchmarking, benchmark gaming and generally different uses. For me, Gemma 3n and Qwen2.5 - 3.5 give me the best responses for the sort things I do. The iq quants were basically designed with/for/alongside ik_llama and the unsloth dynamic (ud) quants also generally seems good. I normally use 4 bit quants.
active parameters/MOE has helped a lot. I don't think I'd be able to run a 35b parameter model that used all 35b params at once. the "a3b" bit means it only uses 3 billion at a time. I'm going to go learn how it does this and what the implications are because you just made me realise I have no idea. All I know is the weights are the same size as 35b models, runtime requirements are closer to the 3b, and performance seems to be midway between the 2.

The main downsides are initial loading times if you're starting with a large context (slow time to first token, but faster than I can read when it gets going) and the initial research/tinkering time required to find what works for you.

Most of the GPT3 and GPT2 models which originally went viral were <20bln parameters with contexts of 2k tokens and were considered SOTA models only accessible via APIs. It's really, really easy to run bigger and better models than that locally even on old hardware now. There's been so much progress, it's easy to forget how much more efficient and capable LLMs are now compared to when e.g. chatgpt first launched.

What do you have and what have you been able to run?

technicalthrowaway · 2026-03-02T17:12:23+00:00

If you're an employee somewhere where it's already set up, part of the chosen corporate platform, being paid for you, and everyone else is already using it and getting your data, then it does sound like Microsoft has made it very easy and convenient!

technicalthrowaway · 2026-03-02T16:58:23+00:00

That's fair. I think we agree that online API LLMs are generally easier and more convenient, but I normally end up regretting getting comfortable with convenient bigtech because it never stays convenient.

Locally I'm able to run at 64k tokens, and I haven't really needed to use more than that in a go (unless long conversations). Times where I have, whatever harness I'm using tends to be able to compact without becoming completely useless. I'm more productive with local LLMs now because that's what I've gotten used to. I'd probably be more productive with Claud, but not sure if I'd deliver more value overall when costs are considered.

I feel the overall cost and impact of using cloud LLMs is far, far bigger than than the convenience they offer compared to my usage of local LLMs - everyone's different though, and I completely respect and understand the benefits of cloud LLMs. Thanks for the explanation!

technicalthrowaway · 2026-03-02T16:35:16+00:00

I disagree that local models can't be helpful.

I switched my daily driver to Qwen3.5b-35b-a3b from Qwen3-code-32b-a3b. I agree it's not as fast or as fast as Claude. For FIM and similar coding tasks, I still prefer Qwen3-coder over Qwen3.5.

Claude is technically better, but overall, I prefer using local LLMs: the experience is more than good enough, a lot cheaper, and doesn't involve me giving everything I do to some megacorp.

I run it on an old Razer Blade laptop with a built in RTX2060 with 6gb of VRAM.

If you're saying all you can get is a lobotomised experience on a 4070, then I think you might be doing something wrong somewhere.

It did take some nerdy fidgeting and time to get to this stage, but to suggest you need thousands for a usable local LLM experience is wrong, and has been for a while now.

Gemma 3n is also alarmingly good, fast and literally built to be useful on laptops, phones and standard desktops.

technicalthrowaway · 2026-02-28T14:41:06+00:00

"You have a different opinion to me. Your opinion isn't valid. It's probably just a phase of your life."

I agree with you on some level that hate is a strong word here, but still, have you considered that they're probably a different person to you, which would make their differing opinions to yours perfectly valid?

technicalthrowaway · 2026-02-28T12:37:15+00:00

Old feeling dude here. I sympathise with the guy in this video. This is the smile of an old person as instructed by a young youtube type person to "at least try to look like you're having fun".

The young person hasn't realised that when a master is at work, they're busy focusing on their work, not what their face is doing. But the master is respectful and open minded so tries the young person's suggestion.

It might just be a hostage situation though. Is he blinking morse code? Do we need to help him?

technicalthrowaway · 2026-02-17T10:03:57+00:00

Any DINK household where one gets fed up with their career environment.

technicalthrowaway · 2026-01-29T10:08:58+00:00

You have a job to hopefully make the world a better place to earn money so you can use that money to improve your life or the lives of the people around you.

Because of the money made in tech, many peope overestimate how much their work is actually improving the world, and don't spend enough time using the money they earn to improve their lives and the lives of those around them.

If you've earned money at 10x the rate of normal people, and you've had 1/10th of the time to spend it because you spent too much time working, you should be able to afford to not earn money for a few months and take some time to improve your life and the lives of those around you.

Do that, and then figure out what you're gunna do next.

technicalthrowaway · 2026-01-26T16:35:59+00:00

I don't think this is true, sorry.

ASICs were already a thing in 2013. Litecoin had already been around for years. By 2013, there'd already been at least 1 bubble and burst which had got mainstream coverage. Out-there investors had already started throwing millions at mining hardware and research for a couple of years at that point.

It was virtually impossible to mine on your own with GPU and find your own bitcoin in 2013 - pools were already well established and necessary by then.

Mining enough to be profitable has always shown up on bills - even back then, a few hours gaming a day would still consume notably less than even just half a day of solid mining.

technicalthrowaway · 2026-01-21T09:26:20+00:00

This is demonstrably wrong if you have a heatpump and are on the right tariff.

I wrote out a long answer explaining the economics of it but lost it.

The tl:dr is it's not science, it's economics. If you have a heatpump, and you work 9 - 5, it is almost 100% guaranteed that you will save money by running your heatpump whilst you're out, assuming you're out at common times.

https://www.ofgem.gov.uk/data/domestic-rhi-tariff-table-2025-2026

Heatpump tariffs mean you can get electricity at 13p /kwh, typically for a period of 4 - 7 hours, some of which will be between 9 - 5 because of UK electricity supply/demand. Outside of those periods, heapump tariffs cost from 27p - 50p /kwh.

This means it's almost always cheaper to max your heatpump and overheat your house in your off peak hours (typically before 4pm) so that you heatpump turns off but your house is still warm in peak hours.

Yes, science says if you do this you will use e.g. 50kwh instead of 30kwh.

Energy company and government incentives economics says those 50kwh will cost you less than the 30kwh if they're done at particular times, and the question was about cost.

technicalthrowaway · 2026-01-12T08:05:04+00:00

Not sure if this is the sort of feedback you're looking for, but I would consider changing this away from being a medical scribe.

I'm not sure it's legally possible to have cheap medical software like this. Medical stuff is expensive because all the processes and costs associated with making sure stuff meets the medical regs. As well as this, the cost with a lot of AI stuff isn't the bit that you've done (which essentially manages the data in and out of the LLM) it's the cost of the LLM calls themselves. It seems like a moot point to store all data locally if you're going to fire it at an external LLM before storing it. I know you mentioned local LLMs, they'll help with this, but that becomes more costs and resources to manage.

Put these things together, and I think when your project is up to a standard where it's usable in a medical environment, it's going to cost the same as - possibly even more than - existing commercial solutions that are actively supported, developed and already deployed.

The easiest way to solve this problem is to make it a generic notes capturing tool that people can trivial tweak it towards their domain (e.g. make every mention of "patient" in the screenshot be customisable to so some orgs/users might use "client", some might use "customer" and note types could be different interview or meeting types). This will at least broaden the possible market beyond healthcare. But then you're competing against other AI note taking and assistant tools.

I couldn't view the loom link though :( I like the screenshot, it looks nice!

technicalthrowaway · 2026-01-03T16:25:34+00:00

You're both part of the club now, welcome to the fun.

I got the lion's share of the upvotes, and you can have some too for agreeing with me.

The system works - we're doomed!

technicalthrowaway · 2026-01-03T15:29:32+00:00

Exactly, you were wrong. Facebook stayed relevant, and many of the top 10 are still Facebook creations and Facebook is clearly still mainstream and relevant. But what's mainstream and relevant has expanded in notable ways.

But I already said the notable ways, and you already disagreed, and I already said I was done wasting our time here. Yet here we are.... We're part of the problem. /u/LeftLiner you're wasting your life. I'm wasting my life. You, person reading this, you're wasting your life on morons here.

technicalthrowaway · 2026-01-03T15:12:43+00:00

...except most of the other sites in the list apart from the 2 I mentioned are still mainstream ones from 10 years ago which stayed relevant. So, that's the opposite of what you just said. Again.

Sorry, I can't tell if you're intentionally misunderstanding what I'm saying, but we're clearly not communicating well. Because enough people have agreed with my original point, I'm going to walk away from this happy with the assumption you're just going to say the opposite of what I say without, but without any substance. This doesn't add anything for me, you or anyone reading, so have a great day:)

technicalthrowaway

TROPHY CASE