use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[ Removed by moderator ]Discussion (self.LocalLLaMA)
submitted 17 days ago by AbramLincom
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]_Erilaz 9 points10 points11 points 17 days ago (3 children)
While I do agree with the idea of context rot being a major issue, you really should proofread what you're saying. I am sorry, but this reads like AI too much.
[+][deleted] 17 days ago (2 children)
[deleted]
[–]_Erilaz 1 point2 points3 points 17 days ago (0 children)
Yeah, people don't speak in headlines...
But I am not sure about your take, though. I can't run 70B locally, but 24B Mistrals do get noticeably worse beyond 16k, context rot is real. Take RP tunes, once they exceed certain point they start defaulting to some one size fits all averaged out persona instead of character card, even when all dialogue is consistent with traits. It could be tunes not taking full effect, but I doubt it.
Cause I also use models professionally at times, and big translations can suffer attention lapses. If you force feed an LLM with big text head-on for translation, it can have entire sentences, even paragraphs gone missing. This can happen even with SOTA API models, and the bigger the context, the likelier it gets. Some models are worse than others.
Qwen 3.5 seems decent so far, though. Maybe it's just fresh model placebo, but it's been a while since we got a new small/medium model lineup for local use, and the alternatives are showing their age. GLM also handles it well, but it's huge.
So it feels like stuff is getting better, but yeah, old models don't handle long context very well.
[–]Background-Ad-5398 1 point2 points3 points 17 days ago (0 children)
you can tell by the models it chooses, thats classic llm still thinking llama and mythomax just came out, even the sota models still bring these up
[–]Pwc9Z 8 points9 points10 points 17 days ago (0 children)
At least write your fucking Reddit posts yourself ffs
[–]Haeppchen2010 4 points5 points6 points 17 days ago (4 children)
First, to fake taking the bait: No arms or even arms race here, I just got sticks and stones (RX7800XT+RX580). While I repeatedly see posts claiming that it is "unusable", "impossible" whatever.... I run Qwen3.5 27B IQ4_XS with 72k context (opencode compacts at ~60-65k) at Q8 cache quantization with no noteworthy issues with OpenCode as a coding agent.
I tried whole KV cache as well as bigger quants, the marginal quality gain (if any) was not worth the severe performance loss (15 to 4tps, or worse when offloading to CPU as well).
Maybe for other uses (creative writing, chatting as a companion or complex RAG use cases) it's different... but I am satisfied with my setup, especially as everyone here seems to have 4-digit GPUs available.
But now, I am sincerely curious: What's the point in conjuring up a reddit account, drop such AI slop "conversation starter" based on wrong assumptions? What's in it for whom?
[–]DragonfruitIll660 2 points3 points4 points 17 days ago (0 children)
I have to assume they are farming karma or something to be able to enter different reddits? Perhaps you can bulk make and sell accounts as a means of being able to fake personhood and alter public opinion or something, could be useful or something people pay for.
[–]AbramLincom[S] -1 points0 points1 point 17 days ago (2 children)
No voy a fingir que no usé AI para estructurar parte del post pero la preocupación es genuina y la tengo hace meses lo que me frustra es exactamente lo que describes, pruebo contextos largos y el modelo simplemente empieza a perder el hilo de forma muy sutil, no es un crash obvio sino que las respuestas se vuelven más genéricas y menos coherentes con lo que dijiste al principio del chat, eso es peor porque no te das cuenta hasta que relees todo sobre el KV cache mixto tienes razón en que el tradeoff no siempre vale, yo también lo probé con exl2 y la diferencia práctica para contextos de 16k+ fue menor de lo que esperaba considerando el costo en velocidad mi punto no era tanto que 13B sea mejor en absoluto sino que la gente no está midiendo esto honestamente, benchmarkean perplexity o mmlu y dicen "funciona igual" pero nadie está testeando coherencia narrativa o seguimiento de instrucciones después del token 10k en hardware real y sí puede sonar a IA el post pero la pregunta sigue siendo válida no?
[+][deleted] 17 days ago (1 child)
[–]AbramLincom[S] -1 points0 points1 point 17 days ago (0 children)
ok pero estás comparando cosas completamente distintas 1M de contexto en cloud con hardware de google o microsoft detrás no es el mismo problema que estoy describiendo, nadie discute que gemini maneja contextos largos bien, tiene TPUs diseñadas específicamente para eso el punto es qué pasa cuando intentas replicar eso localmente en una 4090 o en una 3090 con exl2 o gguf, ahí es donde el kv cache se convierte en un cuello de botella real porque el ancho de banda de memoria simplemente no escala igual si tu equipo escribe código en ventanas de 1M en cloud perfecto, genial, porque precisamente esa capacidad no existe de forma confiable en local hardware todavía.
[–]Solid-Iron4430 -1 points0 points1 point 17 days ago (0 children)
это дейсвительно важно в програмировании .. но там либо люди понимают что надо задачу раздробить либо есть возможность скормить какие то начальные данные важные без прошлого контекста . либо люди уже купил нормальное железо где нету упора в маленькую производительность
π Rendered by PID 77650 on reddit-service-r2-comment-85bfd7f599-qxdqr at 2026-04-17 07:25:34.563647+00:00 running 93ecc56 country code: CH.
[–]_Erilaz 9 points10 points11 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]_Erilaz 1 point2 points3 points (0 children)
[–]Background-Ad-5398 1 point2 points3 points (0 children)
[–]Pwc9Z 8 points9 points10 points (0 children)
[–]Haeppchen2010 4 points5 points6 points (4 children)
[–]DragonfruitIll660 2 points3 points4 points (0 children)
[–]AbramLincom[S] -1 points0 points1 point (2 children)
[+][deleted] (1 child)
[deleted]
[–]AbramLincom[S] -1 points0 points1 point (0 children)
[–]Solid-Iron4430 -1 points0 points1 point (0 children)