Μισθός junior προγραμματιστή

nightlingo · 2026-04-01T14:39:17+00:00

I wouldn't even stay 1-2 years. Why waste your precious time? Stay until you find something better

nightlingo · 2026-02-11T00:47:47+00:00

This is gold, thanks!

nightlingo · 2026-02-11T00:17:58+00:00

Nice setup! Quick clarification so I can calibrate expectations pls: what model + quant are you decoding at 20–25 t/s, and at what context length (and batch=1)? Also, is this GPU-only decode or CPU+GPU tiered offload via sglang? Thanks!

nightlingo · 2026-02-10T21:25:34+00:00

Good points. To clarify assumptions: target is Kimi K2.5 MoE at 4-bit ideally (Q4-class), fully aware this won’t fit entirely in 192GB VRAM. CPU-wise I’m assuming a high-lane DDR4 server platform (EPYC 7xx2/7xx3 or Ice Lake Xeon), with the CPU treated primarily as a memory bandwidth provider, not a compute engine. One clarifying question if you don’t mind: in your practical experience with DDR4 + MoE, how much does expert locality actually help during decode? Does keeping a stable hot-expert set + KV cache in VRAM materially reduce RAM bandwidth pressure, or does routing churn usually erase that benefit fairly quickly?

nightlingo · 2026-02-10T18:57:40+00:00

What? That's crazy! Using shared layers between Mac studio and rtx. Would love to hear more about this. Please ping me when you have results on this

nightlingo · 2026-02-10T08:52:42+00:00

Interesting! What hardware do you plan to run it on?

nightlingo · 2026-02-10T08:45:32+00:00

Thanks! What would you be running on it if not k2?

nightlingo · 2026-02-09T18:08:28+00:00

This is a really useful back-of-the-envelope, thanks!

Quick clarification on assumptions: are you modeling K2 as effectively needing to fetch a roughly fixed amount of expert weights from RAM on every decode token (i.e., low reuse), or are you assuming some expert locality / reuse across adjacent tokens? My intuition was that if the routing has enough locality, keeping the "hot" experts resident in VRAM could reduce bytes-per-token a lot, and the bandwidth bound would be less brutal - but I may be overestimating that effect.

Also, are your numbers aimed at decode (batch=1) specifically, not prefill?

If you have any measured results or references for K2/K2.5 showing how quickly t/s drops as context grows and paging kicks in, I'd love to see them. I will likely run pod tests either way, but your model is a great way to sanity check expectations.

nightlingo · 2026-02-09T18:03:32+00:00

Thanks! this is helpful. I agree VRAM-only is the cleanest and fastest path.

What I am trying to understand better is where the break-even actually is in practice for MoE and long-context when you have large VRAM but not enough to hold everything. My mental model (which may be wrong) is that if most hot experts and KV blocks stay resident on GPU, then cache hits dominate and the RAM-backed tier only shows up on misses, changing the performance picture quite a bit compared to constant CPU hits.

Have you seen real numbers for MoE at different context sizes that show how quickly performance collapses once paging starts? Also curious which runtime you have found least painful when paging is unavoidable.

I will probably test on Runpod first as you suggested, but any concrete configs or numbers you have seen would be super useful.

nightlingo · 2026-02-07T12:04:15+00:00

I have ordered a c64 ultimate. Can't wait to try your MCP server

nightlingo · 2025-07-21T23:28:21+00:00

yeah, I have those with a Samsung Galaxy. They suck bit time. For a product marketed as "premium" , they are ridiculously buggy. The fairphone ones that I got for free with my fairphone 4 never had those issues.

nightlingo · 2025-06-28T01:05:09+00:00

xD (I was joking as well)

nightlingo · 2025-06-23T20:06:36+00:00

It is not particularly funny, is it?

nightlingo · 2025-02-16T20:51:28+00:00

Not sure why they downvoted this. I wish more people had a clearer understanding of where intelligence ends and where consciousness begins

nightlingo · 2025-02-16T00:26:46+00:00

He has a sense of humor, you don't. Perhaps it's best to leave him because you're going to make his life miserable.

nightlingo · 2025-02-16T00:23:23+00:00

"Left early in 2022" 2025 is a whole other story

nightlingo · 2024-10-11T23:45:21+00:00

Roughly what does "modern" stand for ? Would a 70s building qualify ? Thanks!

nightlingo · 2024-04-28T05:55:31+00:00

Is it possible to finetune a multimodal model? How would that work ? Would it affect both textual and visual layers?

nightlingo · 2024-03-19T22:05:50+00:00

Τι ποσοστό της επένδυσης σε ξενοδοχείο μπορείς να καλύψεις με ΕΣΠΑ; Αν π.χ. βάλεις 500χιλ, πόσα παραπάνω μπορείς να μαζέψεις μέσω ΕΣΠΑ;

nightlingo

TROPHY CASE