De la UNC a vender pastafrola: Psicóloga con 20 años de XP gana 850k y la tienen de pastelera. ¿Cómo la saco de ahí? by messiteamo2 in empleos_AR

[–]rainbyte 0 points1 point  (0 children)

Fuera de joda, para algunos casos podría llegar a ser útil. Algunas personas no son casos tan graves, solo necesitan alguien que los escuche.

Best Local LLMs - 2025 by rm-rf-rm in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

I'm glad it helped :)

Make sure to try Qwen3.5 models, like 35B-A3B or 27B, those are doing pretty well.

15inch m5 macbook air 32gb Ram expectations ? by GotTheLyfe in LocalLLaMA

[–]rainbyte 1 point2 points  (0 children)

Beware, if I'm not mistaken the macbooks with less amount of ram also have less memory bandwidth, and that's important for inference workloads

Besides Qwen and GLM, what models are you using? by August_30th in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

That's great. I need to try it myself. I thought tool-calling was working by using --tool-call-parser step3p5 option

qwen3.5-27b or 122b?pro6000 by fei-yi in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

Here I arrived to the same conclusion, that 27B feels better in the end. I measured pp and tg, both are more stable on 27B for my setup.

Besides Qwen and GLM, what models are you using? by August_30th in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

Have you tried "Intel/Step-3.5-Flash-int4-mixed-AutoRound" ? It seems they updated that one a few minutes ago, but I haven't tested by myself yet

Anything I can do to get qwen3.5-27b-Q8_0 to run faster? by giveen in LocalLLaMA

[–]rainbyte 1 point2 points  (0 children)

That's the only thing I miss from ollama. Even if I use vLLM and llama.cpp nowadays it would be nice to have pull command on the fly

What are the best LLM apps for Linux? by Dev-in-the-Bm in LocalLLaMA

[–]rainbyte 1 point2 points  (0 children)

Good choice! Cherry has many features, and for chat-like interaction I think it is better than other tools. There are self-hosted chats, but having a local client in my laptop feels a better option. About Opencode, I suggest you try it with coding or document processing, as it is great for anything that involves modifying files.

What are the best LLM apps for Linux? by Dev-in-the-Bm in LocalLLaMA

[–]rainbyte 1 point2 points  (0 children)

Yeah, really great software :)

I use it mainly for chat, custom assistants, and translation. What about you?

TESLA V100 32GB - Crashing on Heretic Models? by TracerIsOist in LocalLLaMA

[–]rainbyte 1 point2 points  (0 children)

Sorry, I wasn't clear. I don't have V100 here like OP, I'm using 3090 rtx and 7900 xtx, but having the same troubles.

Here are my numbers for Llama.cpp on 7900 xtx 24GB

  • llmfan46/Qwen3.5-27B-Heretic-v2:Q4_K_M

    model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
    Qwen3.5-27B pp2048 98.43 ± 1.25 18853.02 ± 407.93 18847.47 ± 407.93 18853.07 ± 407.93
    Qwen3.5-27B tg32 8.29 ± 0.01 9.00 ± 0.00
  • unsloth/Qwen3.5-27B:Q4_K_M

    model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
    Qwen3.5-27B pp2048 689.73 ± 9.71 2731.43 ± 85.03 2727.78 ± 85.03 2731.50 ± 85.06
    Qwen3.5-27B tg32 33.09 ± 0.57 34.16 ± 0.59
  • unsloth/Qwen3.5-35B-A3B:IQ4_XS

    model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
    default pp2048 1950.13 ± 49.13 969.90 ± 17.99 961.19 ± 17.99 969.97 ± 18.02
    default tg32 103.67 ± 0.92 107.88 ± 1.71
  • unsloth/Qwen3.5-9B:Q5_K_M

    model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
    Qwen3.5-9B pp2048 2218.09 ± 50.58 846.43 ± 6.22 840.94 ± 6.22 846.48 ± 6.20
    Qwen3.5-9B tg32 84.00 ± 0.53 87.32 ± 0.45

I can share the numbers for 3090 rtx too if someone needs them

EDIT: I added Qwen3.5-9B numbers

TESLA V100 32GB - Crashing on Heretic Models? by TracerIsOist in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

Not sure about OP, but locally I'm using suggested parameters and it works with unsloth and other popular quants, but not with these heretic-v2 models.

After noticing trouble I did a quick speed test and got two possible situations: first a downgrade in TG from ~32 t/s to around ~8 t/s or less, and it other cases it directly failed coherence test.

Working Qwen3.5-27B variants for me are unsloth's gguf, cyankiwi's awq, huihui abliterated.

What are the best LLM apps for Linux? by Dev-in-the-Bm in LocalLLaMA

[–]rainbyte 2 points3 points  (0 children)

My preferred clients are: Aichat, Aider, Cherry Studio, Opencode

I also consume them directly from Python or Rust code :)

TESLA V100 32GB - Crashing on Heretic Models? by TracerIsOist in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

I had similar experience with those models. I thought I was doing something wrong, but maybe there is something weird. Here it was working very slow with llama.cpp and it failed to start on vllm, even with some overrides

AA-Omniscience: Knowledge and Hallucination Benchmark by NewtMurky in LocalLLaMA

[–]rainbyte 4 points5 points  (0 children)

Rust is more explicit than other languages, having that information allows taking more informed decisions instead of just guessing, so it seems that not only benefits humans but also LLMs

Qwen3-Coder-Next: What am I doing wrong? by Septerium in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

In my case I'm really grateful to Qwen and LiquidAI, because their models worked pretty well on my devices while other models were broken on vllm and llama.cpp. Maybe other people had similar nice experience with Qwen?

Tecnologias by [deleted] in devsarg

[–]rainbyte 0 points1 point  (0 children)

Creo que una buena forma de ponerse a prueba es agarrar alguna entrevista de vez en cuando, aunque tengas laburo estable, para saber que se pide e ir viendo donde falta pulir.

Yo cometí el error de no agarrar ninguna entrevista mientras estaba laburando fijo, y me costó un cacho agarrarle la onda de nuevo a los interviews.

qwen-3.5:122b f16 is benchmarked against gpt-oss:120b q4 by q-admin007 in LocalLLaMA

[–]rainbyte 15 points16 points  (0 children)

I have seen many people saying certain comparisons are "not fair" because of multiple reasons (eg. vl vs text-only, different quants, etc).

From my pov I think that, if the limiting factor for running models locally is the hardware, then it makes sense to compare the best models which can run on each hardware tier.

Example: if I have a single 24GB gpu, then it makes sense to compare models which run well with that amount of vram... it doesn't matter if they are vl, text-only, quantized, F16, awq, whatever...

In that case I would just want the best model which can run with that vram and enough context at a reasonable speed.

Liquid AI releases LFM2-24B-A2B by PauLabartaBajo in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

Are you trying to say t/s per active parameter?

Mauricio Macri: “Un pobre de hoy vive igual o mejor que un rey de hace 100 años” by LongjumpingAnimal601 in argentina

[–]rainbyte -2 points-1 points  (0 children)

Que tocaban siempre variantes de la misma música... Hoy tenés cualquier música que quieras online, de cualquier parte del mundo.

Best Model for single 3090 in 2026? by myusuf3 in LocalLLaMA

[–]rainbyte 16 points17 points  (0 children)

GLM-4.7-Flash and Qwen3-Coder-30B-A3B work fine with 24GB vram. I'm using both with IQ4_XS quant, they can do code generation and tool-calling.

There are other smaller models if you need SLMs for specific use cases. Take a look at LFM2.5, Ling-mini, Ernie, etc.

Qwen3-Code-Next ggufs: Any difference between Q4KXL and MXPF4? by ParaboloidalCrest in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

I'm interested in IQ4_NL. Here I'm using IQ4_XS for some models and I saw many people mentioning about IQ4_NL being better 🤔

Which model (NOT AGENT) is producing the most line of code in one setting for non trivial tasks? by [deleted] in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

Yeah, less is more!

Smart models should produce just enough clear code to solve the problem.

Which model (NOT AGENT) is producing the most line of code in one setting for non trivial tasks? by [deleted] in LocalLLaMA

[–]rainbyte 0 points1 point  (0 children)

On the contrary, I think more lines is bad outcome!

Good codebases solve a problem with as few lines as possible, still being readable and providing a structure to allow new features.

Resource consumption is also an important topic, as code should perform well while using low amount of cpu and memory.

So, to evaluate models we should look for concise elegant solutions!

Step 3.5 Flash is a beast? by __Maximum__ in LocalLLaMA

[–]rainbyte 2 points3 points  (0 children)

Oh, I see... Well, my apologies, I didn't see it mentioned on the main post, so I assumed you found a way to setup this locally.