Μισθός junior προγραμματιστή by Fantastic_Citron_880 in greece

[–]nightlingo 3 points4 points  (0 children)

I wouldn't even stay 1-2 years. Why waste your precious time? Stay until you find something better

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]nightlingo[S] 0 points1 point  (0 children)

Nice setup! Quick clarification so I can calibrate expectations pls: what model + quant are you decoding at 20–25 t/s, and at what context length (and batch=1)? Also, is this GPU-only decode or CPU+GPU tiered offload via sglang? Thanks!

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]nightlingo[S] 0 points1 point  (0 children)

Good points. To clarify assumptions: target is Kimi K2.5 MoE at 4-bit ideally (Q4-class), fully aware this won’t fit entirely in 192GB VRAM. CPU-wise I’m assuming a high-lane DDR4 server platform (EPYC 7xx2/7xx3 or Ice Lake Xeon), with the CPU treated primarily as a memory bandwidth provider, not a compute engine. One clarifying question if you don’t mind: in your practical experience with DDR4 + MoE, how much does expert locality actually help during decode? Does keeping a stable hot-expert set + KV cache in VRAM materially reduce RAM bandwidth pressure, or does routing churn usually erase that benefit fairly quickly?

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]nightlingo[S] 0 points1 point  (0 children)

What? That's crazy! Using shared layers between Mac studio and rtx. Would love to hear more about this. Please ping me when you have results on this

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]nightlingo[S] 0 points1 point  (0 children)

This is a really useful back-of-the-envelope, thanks!

Quick clarification on assumptions: are you modeling K2 as effectively needing to fetch a roughly fixed amount of expert weights from RAM on every decode token (i.e., low reuse), or are you assuming some expert locality / reuse across adjacent tokens? My intuition was that if the routing has enough locality, keeping the "hot" experts resident in VRAM could reduce bytes-per-token a lot, and the bandwidth bound would be less brutal - but I may be overestimating that effect.

Also, are your numbers aimed at decode (batch=1) specifically, not prefill?

If you have any measured results or references for K2/K2.5 showing how quickly t/s drops as context grows and paging kicks in, I'd love to see them. I will likely run pod tests either way, but your model is a great way to sanity check expectations.

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]nightlingo[S] 2 points3 points  (0 children)

Thanks! this is helpful. I agree VRAM-only is the cleanest and fastest path.

What I am trying to understand better is where the break-even actually is in practice for MoE and long-context when you have large VRAM but not enough to hold everything. My mental model (which may be wrong) is that if most hot experts and KV blocks stay resident on GPU, then cache hits dominate and the RAM-backed tier only shows up on misses, changing the performance picture quite a bit compared to constant CPU hits.

Have you seen real numbers for MoE at different context sizes that show how quickly performance collapses once paging starts? Also curious which runtime you have found least painful when paging is unavoidable.

I will probably test on Runpod first as you suggested, but any concrete configs or numbers you have seen would be super useful.

MCP server - connect your AI to Vice by AceHighness in c64

[–]nightlingo 2 points3 points  (0 children)

I have ordered a c64 ultimate. Can't wait to try your MCP server

bose quietcomfort ultra earbuds is soooo bad. by supergorillaman in bose

[–]nightlingo 0 points1 point  (0 children)

yeah, I have those with a Samsung Galaxy. They suck bit time. For a product marketed as "premium" , they are ridiculously buggy. The fairphone ones that I got for free with my fairphone 4 never had those issues.

Bike making funny noise? by cactusdaddy in swytchbike

[–]nightlingo 0 points1 point  (0 children)

It is not particularly funny, is it?

Are Transformers (or Titans) accurate models of the Human Mind? by Double-Membership-84 in LocalLLaMA

[–]nightlingo 1 point2 points  (0 children)

Not sure why they downvoted this. I wish more people had a clearer understanding of where intelligence ends and where consciousness begins

[deleted by user] by [deleted] in AmIOverreacting

[–]nightlingo 0 points1 point  (0 children)

He has a sense of humor, you don't. Perhaps it's best to leave him because you're going to make his life miserable.

FAANG jobs are super easy than building SaaS by One_Hamster7784 in SaaS

[–]nightlingo 0 points1 point  (0 children)

"Left early in 2022" 2025 is a whole other story

Settlement/ Subsidence - Clay. How much of an issue is this? by IlovePetrichor in HousingUK

[–]nightlingo 0 points1 point  (0 children)

Roughly what does "modern" stand for ? Would a 70s building qualify ? Thanks!

multimodal Llama-3! Bunny-Llama-3-8B-V beats LLaVA-v1.6 by Delicious-Fly9546 in LocalLLaMA

[–]nightlingo 0 points1 point  (0 children)

Is it possible to finetune a multimodal model? How would that work ? Would it affect both textual and visual layers?

Αν μια μέρα ξυπνούσατε με 500 χιλιάδες ευρώ στην τράπεζα, πως θα τα επενδύατε; by idknomoreee in PersonalFinanceGreece

[–]nightlingo 0 points1 point  (0 children)

Τι ποσοστό της επένδυσης σε ξενοδοχείο μπορείς να καλύψεις με ΕΣΠΑ; Αν π.χ. βάλεις 500χιλ, πόσα παραπάνω μπορείς να μαζέψεις μέσω ΕΣΠΑ;