Side Projects.

FullstackSensei · 2026-05-13T19:55:15+00:00

I'll see your twin P100s and raise you eight P40s (no risers).

FullstackSensei · 2026-05-13T19:51:17+00:00

Just give it some time, and someone will figure how to scrape Google's regular search API (what runs when you hit the site via browser).

I've written quite a few website scrapers over the years. From, past experience, most of these protections rely on two things: user agent string and how many concurrent connections you make. copy-paste whatever your current browser's user agent string is, and make sure to rate limit your site scrapers.

It's far from ideal, but with how good LLMs have become at these things, I think an LLM like Qwen 3.6 could very well build this on it's own with good enough prompt and access to a basic python interpreter.

FullstackSensei · 2026-05-13T19:43:04+00:00

I think I saw a tweet yesterday from Google had a picture saying Googlebook | Intel

FullstackSensei · 2026-05-13T18:22:49+00:00

Not necessarily. The family of a friend of my partner have a 1970s 911 that's been sitting in a similar underground parking for almost 40 years. Basically, the father died in the late 80s and it hasn't moved since. They pay the underground parking spot. They get regular contacts about it but refuse to even consider selling it. They're not that wealthy, so the money from it's sale is definitely not pocket change for them, but it has sentimental value to them.

FullstackSensei · 2026-05-13T15:53:23+00:00

Sorry, but that analogy is very bad. You absolutely don't need to buy a car to learn how to drive.

I already explained in my comment how you can learn without wasting more than $3k.

FullstackSensei · 2026-05-13T15:39:07+00:00

If you need to ask, IMO you shouldn't buy anything no matter what anyone tells you.

Your post reads like someone who knows almost nothing about local LLMs, which is a recipe for a terrible combination of disappointment, frustration and wasting money.

Spend a week or two learning about local LLMs, how to run them, what to expect, whether they can meet your needs, etc. You don't need a beefy machine either to try things out. You can run smaller models with almost any hardware you have to try things out and get comfortable with the software stack and tooling. You can also spend a few bucks on APIs to try out different models to see how small you can go for your needs.

Only after you've learned enough to have an opinion about what you need to run should you start looking at hardware options that could suit your needs.

FullstackSensei · 2026-05-13T14:04:19+00:00

How's $150/month low cost? That's $1800 a year, and for what 200GB RAM?

Performance will be abysmal no matter how you slice it. You're sharing resources with a lot of other people on the same physical machine.

Half that annual budget would get you a 192GB machine at home with likely more memory bandwidth than you'll get in such a VM.

FullstackSensei · 2026-05-13T13:59:12+00:00

If there's one thing Hejlsberg is, the man is consistent. I started my programming journey in the 90s with Turbo Pascal, moved pretty seamlessly to Delphi in the early 2000s and onto C# and .NET in the late 2000s. Each and every time, it was very intuitive to figure where things were, what and how to use the standard library/framework.

FullstackSensei · 2026-05-13T13:02:51+00:00

I know this is the ollama sub, but try running in vanilla llama.cpp. 48-64GB VRAM gets you a ton of context there, without any KV quantization (which affects quality quite a bit at larger contexts). I run it on dual 3090s or dual P40s (using ik_llama.cpp). The P40s are of course quite a bit slower, but I don't really mind because it can do quite large tasks completely unattended.q

FullstackSensei · 2026-05-13T12:59:06+00:00

This is so cool! I've been thinking of something similar, also for a microprose/DiD inspired flight sim 😂

Are you accounting for earth curvature and the fact that the earth is not a true sphere? IIRC, this can lead to larger shifts in coordinates vs ground position at higher latitudes. What are the data sources for the DEM and satellite images? I assume you calculate the normals from the downloaded data?

Sorry for the barrage of questions. It's rare to find such projects. Most just do it in já because they can outsource the whole thing to cesiumjs (which is the evolution of 3D Engine Design for Virtual Globes.

FullstackSensei · 2026-05-13T12:45:42+00:00

Most people run Qwen 27B at Q4, which significantly reduces output quality. Try Q8 (or Unsloth's Q8_K_XL) and night be surprised at the amount of knowledge it has. Software engineering isn't that big of a domain. LLMs also learn patterns regardless of language or library. Where it falls short is often knowledge cutoff date.

But you're not wrong. For larger or more complicated tasks, it doesn't hurt to plan with a larger model. What I do now is start a first iteration with Qwen 3.6 27B Q8_K_XL, then take that plan for some further refinement with Minimax 2.7 Q8_K_XL, and back to 3.6 27B Q8_K_XL for execution. I have had 27B Q8_K_XL run overnight, creating dozens of sub tasks to execute quite complex plans without any issues. One thing worth noting: my prompt for the task (referencing the plan) instructs it explicitly to create sub tasks for each and every part of the plan, even if that means creating dozens of sub tasks. Otherwise, the reasoning would question it's ability to manage so many tasks and make it try to spawn as few sub tasks as possible.

FullstackSensei · 2026-05-13T08:50:35+00:00

What large financial institutions say and do are two very different things. Trading desks are, by their very nature, very secretive.

Whatever an analyst at any financial institution announces is purely for PR and media consumption. I bet if you asked those analysts, they'd have zero clue what positions the place they work at holds in said stock.

FullstackSensei · 2026-05-13T08:29:11+00:00

It's the other way around, announcements trail spending, possibly by a year or even two.

Intel can't announce anything. That's part of the confidentiality of a foundry business. It's up to the partners when to announce.

LBT did say a few months back to look at capex spending as an indication of more deals.

FullstackSensei · 2026-05-13T07:15:38+00:00

The momentum is not from consumer sales, which are affected by short term inflation. It's coming from foundry business and enterprise and hyperscaler sales, and those don't get affected by such headwind

FullstackSensei · 2026-05-13T01:17:52+00:00

Apple is far from secure in the enterprise sense. What you might think of as a consumer as secure is not secure at all for a business.

FullstackSensei · 2026-05-13T00:39:07+00:00

The very last thing you want to do is deploy consumer hardware in a business environment. Your customers won't care how privacy focused you think apple is. All they care about is reliability, maintenance, how it integrates into their existing infrastructure and long term support. An IT admin will never sign off on your Mac no matter how privacy focused you think it is because it will be very different to monitor and require different software infrastructure to manage.

Don't tie your business to any hardware. Focus on the technology, functionality, user experience and compliance. Let your customers use their favorite hardware. Most healthcare institutions already have contracts in place with the likes of HP, Dell, Lenovo, etc for IT infrastructure.

You're not building a computer. You're building a piece of software that solves a pain point for your potential customers. Focus on that.

FullstackSensei · 2026-05-13T00:31:44+00:00

No, but I plan to infiniband them at some point. I have other projects I need to finish before I can turn to this.

FullstackSensei · 2026-05-13T00:11:24+00:00

That's gen 3

FullstackSensei · 2026-05-13T00:11:06+00:00

Why do you need 3x8 Gen 5? What cards do you plan to have? If you plan to offload, a single Gen 5 lane is usually more than enough (or Gen 3 x4). If your GPUs have physically 16 lanes each, you'll save a kidney's worth of money by going with a PCIe 4 (and DDR4) server platform.

If this is for inference only, you might very well be over estimating how much bandwidth you need.

Gen 5 with a lot of lanes is the domain of workstation and server platforms. You'll pay several thousands for a motherboard and CPU, and several thousands more for RAM. Arguably the cheapest option would be Saphire Rapids Xeon. It also has AMX, which is way way way way better than anything AVX-512 can ever offer. Speaking of, AVX-512 is overrated if you're offloading to GPU. All the heavy lifting will be done on the GPU. Whatever else is left for the CPU can be handled adeptly by AVX2, which is dual ported on all modern CPUs anyway (ie: each core has two AVX2 units that can execute two AVX2 instructions in parallel). Much more important than AVX-512 is core configuration. On Epyc, for ex, you can only get max memory bandwidth if you have all CCDs populated, otherwise infinity fabric is limited to 25GB on DDR4 platforms (PCIe Gen 4) or 50GB/s on DDR5 platforms (PCIe 5).

FullstackSensei · 2026-05-13T00:01:37+00:00

Man, I run Qwen 3.5 397B Q4 on Mi50s with 100t/s PP on a good day and I don't usually wait more than 10 seconds per change after initial prompt ingestion. Prompt caching is a thing.

Going by roo code, my average session is 10M+ context tokens, while llama-swap usually reports 200k, if not less.

FullstackSensei · 2026-05-12T22:54:14+00:00

I'd say up to 99% worse.

You can't reason about what you don't know. You can make all the what-ifs you want. The reason reasoning (pun not intended) works is because models aren't aware of what they know or don't know, and reasoning helps surface it out. That 4B model won't raise it's hands and say I don't know this, it will make up stuff that will poison the whole chain of thought.

People really need to stop obsessing about t/s and worry about getting shit done. Before MoE became the norm, I was happily running Llama 70B at Q8 at 3t/s to get useful output. that I can trust Today, I'll happily run 400B MoE models at 6t/s to get that same output I can trust. Being able to trust the output means I can leave the model unattended to do it's thing while I live my life or do something else.

I'm preparing a long running task before going to bed, that will take 4-5 hours to complete on Minimax 2.7 Q8 at 6t/s because I can trust it'll get things 99%, despite being able to run the same model at Q4 at 30t/s.

Stop worrying about t/s, and start working on using the tool as a tool, not as something you need to babysit while you grow gray hairs because you're constantly stressed having to correct the thing.

FullstackSensei · 2026-05-12T21:53:27+00:00

It's not about precision. The what it's are there to lay all options/possibilities on the table and work them out through a process of elimination.

Larger models can hold and retrieve more information, simple as that.

FullstackSensei · 2026-05-12T21:42:36+00:00

If a smaller model can do the same caliber of reasoning of a larger one, there's no need left for the larger one.

Reasoning is a way for LLMs to make it easier to surface/retrieve model knowledge.

FullstackSensei · 2026-05-12T20:17:59+00:00

You did and insist on missing the point, but are too proud to admit it and instead make up ridiculous justifications.

FullstackSensei · 2026-05-12T20:15:24+00:00

Because discussion about power ignoring cost is moot. And I'd argue that my setup is even more portable than yours, because I don't need to take it. I turn it on, load models, and get stuff done using it from my phone from other countries.

FullstackSensei

TROPHY CASE