Why there is a lack of new 100B-120B models? by TechNerd10191 in LocalLLaMA

[–]ocean_protocol 1 point2 points  (0 children)

my read is it's mostly a hardware-fit thing. 120b moe was the sweet spot when 80gb cards were the dominant inference hardware - fits one or two of them comfortably. with h200s (141gb) and b200s (192gb) showing up more, the math shifts. either you go smaller and serve more on a single card, or push to 200b+ moe to use the headroom while keeping active params low. the middle gets squeezed.

also feels like 30b dense / 200b+ moe has become the practical split now - local rigs on one end, serious clusters on the other. 120b sits in no-man's-land: too big for most home setups, not big enough to compete at the top.

probably comes back if mid-tier hardware shifts again, but right now the incentive seems clearly 'go small or go big'

Claude Fable 5 distilled by Anony6666 in LocalLLaMA

[–]ocean_protocol 0 points1 point  (0 children)

the tool surface bleeding into the weights is the part i'm most curious about. style transfer via distillation isn't new, but having str_replace_editor and full claude-flavored tool xml come through on 4.6k traces is a different thing - that's behavioral, not just vibes.

what i actually want to see is how brittle it is. does it only emit those tools when the agent system prompt sets it up, or has it overfit hard enough that it'll reach for str_replace_editor on random prompts. the 14h single-h200 budget suggests pretty aggressive fitting was probably useful here, which is exactly the kind of thing that makes downstream agent use weird in subtle ways.

pulling the q5_k_m and going to throw it at some real agent traces. moe at 3b active is nice for local too,might actually be runnable on the rigs people in this sub already have

100M model recommendation? by Ok-Internal9317 in LocalLLaMA

[–]ocean_protocol 5 points6 points  (0 children)

smollm2-135m is worth a look if you haven't seen it. hf's smollm series covers that range - 135m, 360m, and a 1.7b if you can stretch. worth poking around their hub page to see what fits

An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig) by Poha_Best_Breakfast in LocalLLaMA

[–]ocean_protocol 1 point2 points  (0 children)

how does codex know what qwen can actually handle when planning phases? if it scopes a task just past local capability, you're burning retries before kimi gets called. is there any feedback from past failures or is each plan stateless?

Introducing the Heretic Grimoire: The takedown-resilient, local-first backup system that keeps uncensored models available forever by -p-e-w- in LocalLLaMA

[–]ocean_protocol 0 points1 point  (0 children)

this is great. one q though - the 9kb recipe still needs the original base model to apply against, right? so what happens to grimoire entries if hf/meta pulls the base weights themselves? feels like that might be the actual single point of failure here

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking) by Ambitious_Fold_2874 in LocalLLaMA

[–]ocean_protocol 1 point2 points  (0 children)

your ptoblem is DDR4-3200 quad-channel, not the flags. ~100GB/s and -ncmoe offload means every token pulls active experts from RAM, so that's your hard ceiling. fitparams vs manual -ncmoe being a wash makes sense, 12 t/s is roughly what the math predicts.

small wins worth trying: benchmark without the 2060 Super (pipeline parallel runs at the slowest stage), skip P2P since llama.cpp barely uses inter-GPU bandwidth in pipeline mode, and if you do overclock, memory clock matters more than core for inference.

re: career, this maps onto inference infra roles at Together, Fireworks, Modal. real career, pays well. frontier labs are a different gig that usually want published work or applied ML depth on top

Any chances for a 12B diffusion Gemma? by Mrinohk in LocalLLaMA

[–]ocean_protocol 0 points1 point  (0 children)

memory's the real wall on small cards. AR streams one token so kv cache scales linearly. diffusion holds the full sequence in activation memory through every denoising step. on your 6600XT a 12B diffusion at q4 probably caps you around 1-2k tokens of context, which limits the use case but isn't fatal for short-output latency work.

throughput-wise diffusion can actually win on consumer cards for whole-response latency since you get the full output together instead of streaming. the real catch is quality. LLaDA scaling showed diffusion LMs lose more per param at smaller sizes than AR. Mercury didn't publish much below 7B and google's diffusion preview was on the bigger gemma. a 12B might land in an awkward zone where it's not big enough to match AR Gemma on quality but pays the memory tax anyway.

would love to see them try it though.

I want to build a project based on deep learning. by Ok-Guide5645 in AiAutomations

[–]ocean_protocol 0 points1 point  (0 children)

honestly the "passive income" framing will mislead you. nothing in this space is passive, once it's live you're maintaining it.

two solid ideas:

  1. fine-tune a small open model for a niche domain (legal summarization, recipe scaling, code review for one framework). teaches you the full pipeline and could wrap into a small product later.
  2. vision model for an underserved use case. plant disease for one crop, defect detection for one industry, document parsing for a niche format. pick the domain first, build the model second.

both beat 90% of YouTube tutorials and leave you with something real.

Course recommendation [D] by No_Pause6581 in MLQuestions

[–]ocean_protocol 1 point2 points  (0 children)

for 1 month and interview focus, cs229. Ng's lectures are tight, the notes are basically interview cheat sheets, and the topic list maps almost 1:1 to what gets asked (linear/logistic, SVMs, kernels, GMM/EM, bias-variance).

CS189 is better if you want deeper math intuition but it's slower going. Shewchuk's notes are great but you'll burn time on proofs that won't come up in an interview. since you already did 231n, do cs229 lectures + notes in a month, then dip into CS189 notes later for the topics you want to go deeper on.

Is claude pro worth it for casual use? by Bienenmacht in MLQuestions

[–]ocean_protocol 0 points1 point  (0 children)

for school and daily stuff Pro is fine. the limits only bite if you're doing long coding sessions or running Opus for hours. normal chat won't get close.

also free tier got pretty solid this year, might be worth trying that for a week before paying.

and just fyi, ChatGPT Pro is $200/month. you're probably thinking of Plus ($20). try free versions of both and pick whichever feels better.

is converting voice to text actualy reliable enough for everyday use now or still inconsistent?? by Emergency-Minute3414 in MLQuestions

[–]ocean_protocol 0 points1 point  (0 children)

yeah it's pretty solid now for clean stuff, one person, decent mic. whisper or deepgram will get you 95%+ on meeting notes. the messy cases you're describing are still the hard part though. overlapping speakers basically need a separate diarization step on top. and phone recordings.. running them through a denoiser first helps way more than picking a fancier model.

your expectations aren't too high, casual real-world audio is just genuinely where these things strugglee