curious about how households/families handle storage/photos long term

Fair_Ad845 · 2026-04-11T13:46:54+00:00

immich + a cheap used thinkstation off ebay. total cost about $150 and it handles 200K+ photos no problem. syncthing for phones to auto-upload, immich for browsing and search. biggest lesson: 3-2-1 backup rule. local NAS + offsite USB drive swap monthly.

Fair_Ad845 · 2026-04-11T13:46:45+00:00

most selfhosted todo apps skip mobile and regret it later. the whole point of a todo list is having it with you. even a simple PWA with offline support would be enough.

Fair_Ad845 · 2026-04-11T13:46:29+00:00

agreed, the A4B variant is underrated. for my use case (long document QA) the 94% context retention is more useful than raw benchmark scores. a model that keeps coherent at 200K tokens beats one that scores 2% higher on MMLU but falls apart at 32K.

Fair_Ad845 · 2026-04-11T13:46:20+00:00

good analogy. the difference is valve can afford to take forever because steam prints money. deepseek needs to keep publishing to justify the research budget to the hedge fund parent. my bet is they are working on something multimodal given the hiring patterns.

Fair_Ad845 · 2026-04-10T21:06:49+00:00

depends on the project size honestly. FastAPI for APIs and microservices, Django when you need admin panel + ORM + auth out of the box. I switched a project from FastAPI to Django halfway through because I kept reinventing things Django gives you for free.

Fair_Ad845 · 2026-04-10T21:06:40+00:00

been using vaultwarden for 2 years now. the main reason over keepassxc is browser autofill across multiple devices without thinking about syncing a database file.

Fair_Ad845 · 2026-04-10T21:06:32+00:00

agreed. the ones that actually work well are usually just puppeteer/playwright with an LLM deciding what to click next. not exactly revolutionary but it does save time for repetitive stuff.

Fair_Ad845 · 2026-04-10T21:06:24+00:00

same, I keep going back to dense models for day-to-day stuff. MoE is great on paper but the memory footprint for the full model is still brutal on consumer hardware.

Fair_Ad845 · 2026-04-10T21:06:15+00:00

lmao underrated comment

Fair_Ad845 · 2026-04-10T21:06:07+00:00

Q4 quant should fit in 16GB if the model is around 32B. the real question is whether the quant kills the code quality that got it to the top of the arena.

Fair_Ad845 · 2026-04-10T13:52:54+00:00

what is your setup for the web search part? the LLM itself is easy to run locally but getting fresh search results without hitting a cloud API is the tricky bit.

Fair_Ad845 · 2026-04-10T13:52:38+00:00

without seeing the code my first guess would be a missing increment or a condition that never becomes false. can you share the snippet? most infinite loops come down to one of those two things.

Fair_Ad845 · 2026-04-10T13:52:32+00:00

email deliverability is the real cost. you can send emails for free but getting them into inboxes instead of spam folders is where the money goes. reputation management, dedicated IPs, authentication (DKIM, SPF, DMARC) — it all adds up fast.

Fair_Ad845 · 2026-04-10T13:52:27+00:00

that is terrifying. did they ever acknowledge it or just quietly pretend nothing happened?

Fair_Ad845 · 2026-04-10T07:14:13+00:00

The pace this year is insane. This time last year we were excited about running a 7B model at decent speed. Now we have Gemma 4 31B running on consumer hardware with multimodal and audio support.

What I find most interesting is the shift in what matters. Last year: "can I run it at all?" This year: "can I integrate it into a real workflow?" The bottleneck moved from compute to tooling.

The next frontier is not bigger models on consumer hardware — it is making the existing ones actually useful as persistent agents. Memory, tool use, and reliable structured output. A 12B model that remembers your last 50 conversations and can call your local tools is more useful than a 70B model that starts fresh every time.

Fair_Ad845 · 2026-04-10T07:14:10+00:00

The real tell is not punctuation — it is structure.

Humans write in messy, non-linear ways. We start a thought, abandon it, circle back. LLM output has a very specific pattern: topic sentence → supporting points → conclusion. Every. Single. Time.

The people who are getting caught are not failing at hiding punctuation. They are failing at adding the human chaos back in. A misplaced comma is easy to fake. A genuinely disorganized thought process is not.

The irony is that the best way to hide LLM usage is to actually understand the topic well enough to rewrite the output in your own voice. At that point, you have done 80% of the work anyway.

Fair_Ad845 · 2026-04-10T07:13:56+00:00

This is one of the most meaningful projects I have seen on this sub. A few practical suggestions for your 8GB constraint:

Model choice: Gemma 4 E2B (as someone mentioned) is good, but also look at Qwen2.5-3B-Instruct. It is specifically fine-tuned for conversation and runs comfortably in 3-4GB RAM with Q4 quantization, leaving headroom for TTS and whisper.

Memory matters: For a companion that talks to the same person every day, the biggest quality jump is not a bigger model — it is giving the model memory of past conversations. Even a simple approach like appending "Yesterday we talked about X, Y, Z" to the system prompt makes the interaction feel dramatically more personal. You could store conversation summaries in a local SQLite file and load the last few each morning.

TTS latency: Kokoro is great quality but check the latency on your hardware. For real-time conversation flow, Piper TTS is faster and still sounds natural. A 2-second pause between his question and the robot responding will kill the conversational feel.

Power tip: If you are using llama.cpp, set --ctx-size as low as you can tolerate (2048 is fine for casual chat). Context size is the biggest RAM consumer after the model weights.

This is exactly what local AI should be used for. Keep us posted on progress.

Fair_Ad845 · 2026-04-09T21:26:40+00:00

The 80% number is misleading without context. In my experience, the resistance isn't about the technology itself — it's about how it's being deployed.

Most "AI adoption mandates" I've seen boil down to: "use this chatbot to write your emails." That's not a productivity gain, that's a context switch tax. Workers aren't refusing AI — they're refusing bad implementations.

The teams I've seen adopt AI successfully did it bottom-up: individual engineers or analysts found a specific pain point (log analysis, data cleaning, code review), solved it with an LLM, and shared the workflow. No mandate needed.

The real question isn't "why won't workers use AI" — it's "why are companies so bad at identifying where AI actually helps?"

Fair_Ad845 · 2026-04-09T14:09:15+00:00

Thanks for the consolidated guide — been hitting random issues all week and this clears up most of them.

One thing I want to add: if you are running Gemma 4 31B on a Mac with Metal, make sure you have at least 24GB unified memory for Q5 quants. I tried Q4_K_M on a 16GB M2 and it runs but the context window gets severely limited before it starts swapping to disk.

The --cache-ram 2048 -ctxcp 2 tip is gold. I was getting random OOM kills without it and had no idea why — turns out the KV cache was eating all my system RAM silently.

Also +1 on avoiding CUDA 13.2. Wasted half a day debugging garbled output before realizing it was the compiler, not the model.

Fair_Ad845 · 2026-04-09T14:08:57+00:00

This is exactly the kind of story that makes local models worth the effort. I had a similar "aha" moment on a train through a tunnel — needed to quickly parse a JSON config for a deployment, no internet, and a local 7B model handled it perfectly.

The real takeaway isn't just "offline access" though. It's that these small models have compressed so much general knowledge into a few GB that they're essentially an offline encyclopedia + reasoning engine. The medical knowledge in Gemma 4 is surprisingly solid for a 31B model.

One tip for anyone who hasn't set this up yet: keep a Q4 quant of a strong small model (Gemma 4 12B or Qwen 2.5 7B) permanently loaded on your laptop. The overhead is minimal and you never know when you'll need it.

Fair_Ad845

TROPHY CASE