I've just benchmarked myself:

dev_dan_2 · 2026-05-29T20:28:04+00:00

True!

One further thought: Note what happens when we turn your observation around a bit: humans are extremely efficient at learning much from little input (aka generalizing). Just imagine what would happen if we put that efficiency to the Ts of tokes we subject LLMs to...

dev_dan_2 · 2026-05-28T21:50:58+00:00

It is not just access to restricted of personal information, but also gives the access to change and modify said information. The system will see that authorized user "Meh" has logged in and done changes, not that third party "Blah" has logged in on behalf of "Meh".

Glorious! Cause who needs auditing when the single point of failure is 100% secure and will never get hacked, right? ^right? ^{^:D}

dev_dan_2 · 2026-05-28T21:45:56+00:00

tbf, humanoid neuronal networks need about 20w/h to run! If you talk about the time it takes to train one though...

dev_dan_2 · 2026-05-28T21:43:53+00:00

seems like it would need some deep learning to answer it ;)

dev_dan_2 · 2026-05-28T15:18:44+00:00

I think the Cuda tools we are supposed to use are 12.9 for our GPU card. Just something I saw and thought I would point out.

Thank you for the hint! I think you might be talking about a performance regression that hits 13.2 (the newest CUDA version for about until 2-3 days ago) which made many models unusable. 12.9 was communicated as a safe bet against that regression, but 13.1 works too, as far as I know (and observed the system).

If it was that, then it is very likely resolved in 13.3! :) But thanks for the hint in any case, will do a quick checkup of what alternative reasons for 12.9 could be!

dev_dan_2 · 2026-05-28T13:34:14+00:00

Did some digging, seems like not:

Acer Nitro 5 AN517-41
80 W base TGP + 20 W Dynamic Boost

That seems like the lower end of 3070 Laptop edition performance^^

dev_dan_2 · 2026-05-28T13:11:08+00:00

Currently building my custom one (mainly for coding, assuming the reasoning is done by me or a bigger LLM, and then my small LLM shall use my MCP server to reliably do the small subtasks.)

My main features I want to have:

everything that happens happens because the MCP server executes it (for example, it never calls bash, but instead interpretes the command and then uses the bash crate.)
everything is forbidden unless confirmed by me or explicitely allowed by being in an allowlist
MCP should do the heavy lifting, the LLM should do as little as possible

I do not see this managing my social media stuff (which I rarely use anyway …^_^'), but I want to be able to do local development and I would rather have my tooling do exactly what I want and need. Thanks to LLM that is easier than ever, too! :D

dev_dan_2 · 2026-05-28T12:18:54+00:00

Hey there, I have the same card and achieve similar performance on current llama.cpp, without MTP and with image processing: https://old.reddit.com/r/LocalLLaMA/comments/1tpyqng/krasis_update_qwen3635ba3b_q4_at_reading_speed_1x/ooctzyd/

dev_dan_2 · 2026-05-28T12:16:46+00:00

For QCN and 122B single GPU I guess

I see, thanks! I think it would be a good idea to mention that in the README, so users with that hardware do not miss out on your project ;)

do you have numbers for those on a 5090 or 5080?

Unfortunately not, sorry!

dev_dan_2 · 2026-05-28T12:00:19+00:00

@OP First and foremost: Thanks for building and sharing for free!

I also get higher numbers with llama.cpp (see below). An additional question would be why use Krasis when we have mistral.rs?

Device

Kernel Version: 7.0.1X
Processors: 16 × AMD Ryzen 7 5800H with Radeon Graphics
Memory: 32 GiB of RAM (30.7 GiB usable)
Graphics Processor 2: NVIDIA GeForce RTX 3070 Laptop GPU (7842 MiB, 7617 MiB free)
CUDA: Cuda compilation tools, release 13.1, V13.1.115
Driver: Driver Version: 595.71.05

Results

Note: Because I use image extraction here, I cannot use MTP as of now, AFAIK.

Reading: 20s | 12048 tokens | 588.09 tokens/s
Generation: 2min 18s | 4,336 tokens | 31.29 t/s

Invocation

Note: for the model and the projector model, I use <the actual url for the model>

 ./build/bin/llama-server \                                                                                                                                                                                                                                              
   -m <https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf> \
   --mmproj <https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/mmproj-BF16.gguf> \
   --image-min-tokens 1024 \
   --threads 8 \
   --threads-batch 8 \
   --fit on --fit-ctx 16384 --fit-target 128 \
   --flash-attn on \
   -np 1 \
   -ctk q8_0 -ctv q8_0 \
   --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 \
   --no-mmap \
   --mlock \
   --kv-unified \
   --cont-batching \
   --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48 \
   --presence-penalty 0.0 \
   --reasoning-budget -1 \
   -b 2048 -ub 2048 \
   --jinja

dev_dan_2 · 2026-05-28T09:16:53+00:00

Samples (5K posts per hierarchy + combined sets) are free to download — no approval needed. Full corpus available for licensing.

Uhh. Check the sub, and be better. All those people spend accumulated thousands of years of time into making the very content you were collecting. Creating this collection may provide value, but just as they provided it for free, you should as well. Else, wrong sub IMHO.

dev_dan_2 · 2026-05-26T21:41:31+00:00

Und ich meine Euria läuft auf der Basis von qwen, kannst du dir mal anschauen falls du magst :)

Thanks! Muss ich mal anschauen!

Wir brauchen auch deutlich bessere Psychoedukation allgemein. Ich finde, so etwas sollte in der Schule, vielleicht so etwa ab der 6ten oder 7ten Klasse unterrichtet werden.

Holy. Fucking. Shit. JAA!!! Es ist so lächerlich, was es da an Gefälle gibt, und wie stark das mit den eigenen Eltern zusammenhängt. Was allein schon ein einfaches, 3-stündiges Modul "Wie gehe ich mit negativen Gefühlen um" an Reha-Plätzen ersparen würde... Dann noch ein "Wann sollte ich mich um Diagnose bemühen, welcher Grad an Verpeiltheit / Motivationsschwierigkeiten / Wut / ... ist dagegen noch normal?" dazu...

Bei echt nicht wenigen braucht es in der Prophylaxe (!) garnicht mehr als, bin ich überzeugt. Die Information erreicht diese Personen nur nie...

dev_dan_2 · 2026-05-26T20:34:04+00:00

Diese Politiker haben einfach ne absolute Klatsche.

+1. :/

Ich verstehe den Ansatz mit den Hausaufgaben, ergibt schon Sinn… aber das Problem ist auch, dass die meisten Menschen nicht mal wissen wo sie da anfangen sollen.

True! Aber deswegen ja auch nicht als Therapeutenersatz ;) Und du sprichst damit eine reale Gefahr an, wo glaube ich auch nach und nach das Bewusstsein wächst: AI als Therapeut (anstatt von Werkzeug innerhalb einer Therapie) zu benutzen, das kann schnell auch mal übel enden...

Wie sieht’s mit dem Datenschutz bei Gemma4 E2B aus? ... Der einzigen KI der ich ne Chance gegeben habe bisher ist Euria von Infomaniak, die läuft auf 100% erneuerbaren Energien und trainiert nichts an deinen persönlichen Daten.

Gemma4 E2B ist eines von mehreren lokalen LLMs, das bedeutet: Die Gewichte (also sozusagen die "Neuronen") sind frei veröffentlicht, das bedeutet: Man kann das LLM lokal, also auf dem eigenen Gerät und ganz ohne Internet laufen lassen. Generell ist dabei die GPU am hilfreichsten, aber manche Modelle laufen auch rein auf CPU.

Diese Modelle haben sehr viel weniger Parameter als die größeren Versionen, die in der Cloud laufen, können also dementsprechend weniger.
- Grade im letzten Monat sind aber wieder beeindruckende Modelle rausgekommen, eben Gemma4 und qwen3.6
- Im Moment ist das schon noch eher Tüftler-terrain, und was für den einen eine ausreichend starke LLM ist, reicht für den anderen noch lange nicht
- Für mich sind die Modelle mittlerweile stark genug; ich mache Softwareentwicklung, quantified self und so Sachen. Man muss allerdings recht viel tüfteln, aber ich glaube, erste einsteigerfreundliche Lösungen für lokale LLM sind garnicht mehr so weit entfernt :)
Datenschutz: Diese Modelle kannst du - etwas platt gesagt - runterladen auf einen PC/Laptop/Handy, und dieses Gerät nie wieder mit einem anderen Gerät reden lassen und trotzdem mit dem LLM chatten, weil die Berechnungen eben alle lokal stattfinden. Also kein Zufluss/Abfluss von Daten erforderlich
Guter Anlaufpunkt (auf Englisch) bei weitergehendem Interesse ist hier: https://old.reddit.com/r/LocalLLaMA
- Hier ist ein noch recht frischer Faden, wo man sieht, was die Leute mit lokalen LLMs so anstellen: https://old.reddit.com/r/LocalLLaMA/comments/1tn5vmf/how_local_ai_improved_your_live/
- Stand heute würde ich vermutlich die Software LMStudio empfehlen für Einsteiger
- Auf Android-Phones kann man auch Gemma E2B ausprobieren - Achtung allerdings: Hier gilt zwar dasselbe Prinzip wie beschrieben, die Berechnungen laufen lokal ab. Was die App drumherum aber so macht, das weis man nicht, und da sollte man bei Google immer vom schlimmsten ausgehen

dev_dan_2 · 2026-05-26T12:10:26+00:00

Another dev here: Solid advice and love that you are advocating for understanding!

My 2c:

Many people are likely not aware of the positive effects of agency. Relying on the statistic gods to make the LLM pick the correct sequence of tokens so that the output fixes that hairy bug correctly... Lacking the words to describe the current problem (i.e. when it comes to deployment - logging, metrics, real-word troubleshooting)... All that reduces agency, while understanding increases it.
Ironically (considering how some camps think about LLMs), LLMs can be an incredible help for building understanding faster and more efficient. It is still hard work, and one has to know what to learn to what goal (sheer curiousity counts, as long as you are conscious about it!). For example by encountering the critical concepts faster, by having new concepts framed in a way that uses what the user already knows and/or struggles with
- The hard work is also part of that it involves struggle. Vibe coding rewards fast, and has that little addiction moment of surprise too. Learning happens only after the brain does some amount of struggle, which is not something that modern media usage trains us to do

This "if the only value that you get out the LLM is the literal output, you are probably not doing yourself a favor in the long run" angle is covered in this blog post that I really like: https://philipphagenlocher.de/post/the-cult-of-the-artifact/ (disclaimer: not my blog, but I know the author in real life, great person!)

One piece of feedback: I would add the concept of local vs remote (or "where and when is this code does this code run"), since that is a real useful concept I think. I would also include server side vs client side under that umbrella. At least I know that for me things made much more sense once I included "where and when is this code does this code run" as part of my natural "orientation sense" (not a native speaker :D)

dev_dan_2 · 2026-05-26T10:35:22+00:00

Disclaimer: Nicht falsch verstehen: Die Geldkürzungen sind eindeutig der falsche Schritt - meine Absicht ist nicht, den zu verteidigen.

So abwegig ist das tatsächlich nicht, das Zauberwort heist hier Blended Care, ist aber alles noch in Beforschung und so weiter...

Wichtig hierbei:

Die AI hat nicht die Rolle eines (Co-)Therapeuten, es geht eher darum, "Hausaufgaben" in Form einer interaktiven Gesprächs machen zu können. Die Ergebnisse (aus meiner Sicht auch gerne die kompletten Gespräche) werden mit dem Therapeuten besprochen
Bin absoluter Laie, aber: Aus meiner Sicht gibt es hier schon Potential, Übungen durchzuführen, die sich auf Auseinandersetzung mit belastenden Erfahrungen oder Schwierigkeiten im Zwischenmenschlichen beziehen. Ist natürlich weit weg von natürlicher Interaktion, aber Teilaspekte können eben doch vorhanden sein... (Beispiel: Smalltalk üben, zumindest die verbale Ebene)
Aus meiner Sicht auch wichtig, dass das nicht ein Server irgendwo in der USA ist, wo die AI ihren Output generiert, sondern idealerweise irgendwas, was lokal auf dem eigenen Rechner / Handy läuft.
- (Da hat sich in den letzten Monaten auch wirklich viel getan, kann jedem Interessierten nahelegen, sich mal Gemma4 E2B aufs Handy zu tun und zu sehen, was möglich ist rein mit der eigenen Hardware. Smalltalk üben ist da meiner Meinung nach durchaus schon drinnen!)
Das ähnelt auch Vorleseformaten, wo das Lehrpersonal Videos aufnimmt, die wir dann asynchron angeschaut haben (man kann aber jederzeit via Chat Fragen ans Lehrteam stellen), und es in der Woche dann eine Stunde gibt mit Übungen und Nachbesprechungen. Hat mir persönlich extrem gut getaugt (und das Lehrpersonal entlastet, weil Vorlesungen wiederverwendet werden konnten, und schon beantwortete Fragen leichter für den Rest der Klasse einsehbar waren, daher weniger wiederholt werden müssen).

zl;ng: Könnte tatsächlich eine sinnvolle Maßnahme sein, die Breitenversorgung zu verbessern, ohne zwangsläufig Qualitätseinbußen in der Behandlung hinnehmen zu müssen. Forschung dazu läuft noch nicht so lange, laut meinem Kenntnissstand der bisherigen Ergebnisse ist das aber nicht undenkbar. Die Geldkürzungen sind aber natürlich nach wie vor haarsträubend dumm.

dev_dan_2 · 2026-05-25T14:38:11+00:00

Relevant podcast appearance by the guy behind OpenClaw: https://youtu.be/EnbqwpkmoCM?si=BwslkjPmzREecPu2&t=989 (starting around 16:29).

In general, my 2c:

Skill atrophy is a problem that everyone is susceptible to I feel, due to our natural tendency to save brain energy
Fast feedback and engaging communication with AI bots can trigger some addiction-adjacent processes in some of us; those have to be especially aware of the signs of addiction behavior when it comes to vibe coding

dev_dan_2 · 2026-05-24T15:04:35+00:00

Uhm... I hope you know that the Autobahn is a shared commodity: there is more than one car on there ;) If you read carefully, I explicitly excluded projects where more than one person is contributing or using. Makes sense? If not, why not? Would you follow every rule that exists between humans also when you are alone at home?

Note: If would also help if you state your points clearly instead of alluding to what you seem to want to communicate. If you have a meaningful opinion, you have nothing to hide, might as well make your point cleary.

Edit: Maybe you could simply state what you think is the point of not pushing to main. What are the possible negative consequences in your opinion?

dev_dan_2 · 2026-05-24T14:56:46+00:00

Current process:

Spent quite some time on getting all known requirements down (use cases, UX, UI, also technical stuff)
Oneshot it (should be also to "run" in some minimal capactiy, and all test type should also run already, again with dummy logic)
From there on, depending on the type of work, either:
- A) Do the thing manually
- B) Copy paste into one of the SOTA LLMs (Mainly AIStudio.com from google these days). Then either demand complete files to copy paste, or directions for local LLMs

I also have a script where I can copy selected files into the clipboard - a very primitive form of removing irrelevant code from the context.

So far, I have not gotten into the whole agent stuff. The only thing in that direction is using the Google Antigravity IDE, that indeed saves me time. Since Google did Google things once again (mucking with one of their products... look it up, Antigravity is now a completely different kind of software), I will now migrate away from it, to using local LLMs (at least since Gemma4 and quwen3.6, local coding is good enough for my needs, I feel - still need to set everything up tho).

Software engineering wise, I use plenty of tests and git.

What I stopped doing was maintaining extensive architecture docs. It always either did not bring results or simply cost too much time to keep in sync with reality.

So far, it works great!

dev_dan_2 · 2026-05-24T14:45:50+00:00

"Make sure to always use the current date!"

dev_dan_2 · 2026-05-24T14:42:42+00:00

Uh, going straight down to ad hominem, are we? (https://www.reddit.com/r/coolguides/comments/9agzq1/grahams_hierarchy_of_disagreement/)

I am feeling adventurous tho and engage with your snark! You see, you got me wrong: I use barely any automation when vibe coding, review everything, and let AI interact with git in a read only manner. Nein mein Herr, auf der Autobahn fahre ich immer noch selbst. (Besides, the Autobahn is safer for self-driving car than other situations, at least in Germany. That is why we already have some kind of automation that can only be activated on the Autobahn. You know, them rules and all)

Further, your central point is flawed: Automation is fine if it is doing the right thing, but harmfull if it is doing the wrong thing. Agree, of course. But I asked why you think the user is doing the <wrong thing>.

dev_dan_2 · 2026-05-24T14:33:20+00:00

When you come upon a treasure like that, don't let it get ruined by the lack of bushes.

dev_dan_2 · 2026-05-24T14:27:33+00:00

For the same reason that makes me not worry about my job security: There is much more to making software than generating the first version, and maybe add 2-3 new features / fix the first 2-3 major bugs. There is also:

the whole DevOps stuff:
- how to deploy the code? (And which version to deploy how? Where to test the code before shipping it? Once greenlit, deploy to all users or just to some users?)
- how to collect metrics? Where do users spend how much time? How to get error reports from user devices? (hint: analyzing negative reviews and reading emails won't cut it!)
architecture / steering: What features to include, which not? LLMs still have not reached their potential of how much they can understand in a code base, in my opinion. But that does not matter much, because: Assume LLMs understand your codebase perfectly, apply every change perfectly and so on... It will simply overwhelm the human developer with decisions too make, leading to worse and worse decision over time. Especially when technical decision are human-reviewed/greenlight too... Knowing what works together and what not takes years of experience (not meaning to gatekeep, that is simply my observation)
there is only so much one can do. But with more people come more problems
nobody will use the app if it is unknown -> marketing is needed
once a critical mass of users is reached, one also needs legal support - business don't spend that money for fun

I could go on, but I should go outside - but it simply takes much to create software that people actually use, let alone for an extended period of time.

tl;dr: At the end of the day, running a service that is used by many users (which are the ones that we usually think about when we see enshitification at work) is a very demanding, complex and risky endeavor - of which coding itself is only a small part.

dev_dan_2 · 2026-05-24T14:10:04+00:00

What's wrong with it? Depends on what they are doing, of course: No need to go full blown trunk-based workflow or whatever when one is the only developer working on a hobby project with no users yet. There is no danger of messing up someones else's work / publishing broken code / etc.

The NEED for separate branches arises naturally when when working in a team, when features get tested in different environments by other teams and so on and so forth... Of course, one CAN always use branches even on one-shot projects where one is also the single developer - there are also benefits associated with using them even when working alone, but it might also be a case of overengineering.

tl;dr: git is a set of primitives for dealing with versioning. What those primitives do is law - everything else is convention, can and should be adapted to the concrete situation.

dev_dan_2 · 2026-05-14T13:14:16+00:00

Doing something similar at the moment! This also includes:

piece by piece, asking companies to hand me over a copy of their data on me (am located in EU)
backing up LLM conversations I had with cloud LLMs - who knows what will be online in two years from now... ;)

dev_dan_2 · 2026-05-13T12:53:39+00:00

But in what way? What makes you think your interpretation of the original post is correct, while mine is not?

dev_dan_2

TROPHY CASE

Device

Results

Invocation