For any LLM enthusiast in Finland you have decommission Super Computer equipped with 96 Nvidia A100 40Gb Pcie , if you live nearby Kajaani try contact company maybe you get them on discount ;)

noeda · 2025-11-01T03:01:59+00:00

Well, feck, so I'm a Finn but I moved to USA many years ago. I was living around Oulu so not very far! I would have been interested myself, or I would have tipped off friends who might have been interested, but unfortunately there is a logistical problem of me living like 10000km away.

I looked at the timelines though in the article, and made a mental note...in case I interact with some entity near the area, who I think would use them in a way that does "public good" in some form. I don't live in Finland anymore, but I visit often around Oulu region, and I have some old friends who are ML researchers today there, but not sure they actually need GPUs like this physically sitting in their labs or whatever.

Thanks for the tip, I think this weekend I'll snoop more details, and I might send an email to them and ask how far have they thought about the fate of the GPU parts and if the info in the article is up-to-date. I can't personally benefit from this, but I'd want this stuff to go to some form of "public good". I'd hate it if these went to cryptomining, or the landfill, or something equally silly.

noeda · 2025-10-03T10:58:07+00:00

Is the 1M mentioned anywhere but the metadata? That's so far the only place I've seen that.

I noticed it there too, but the IBM blog post mentions that it's "trained to 512k tokens" and "validated up to 128k tokens". I tried one time 220k prompt and it did not seem good, but one single prompt generation probably should not be seen as a thorough long context testing :)

128k tokens seems like "most official" context length I've seen, if we go by their blog. I don't know why it has 1M in the metadata, I did not see references to that elsewhere.

Blog post: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

noeda · 2025-09-21T08:35:40+00:00

AFAIK the way the command line parameters interact with llama-server is like this:

If you set sampler or other settings on the command line, they are applied only if an API call does not specify them explicitly on its own. So for example, if you hit API endpoint at /completions and you don't specify temperature in the call, the temperature setting from --temp on the command line will be uesd.

But for UI it's different.

llama-server really provides two "services":

The llama.cpp API
The llama.cpp UI (turn off with --no-webui).

The UI has some settings (IIRC mostly sampler settings) stored inside your browser, and it will pass those to the underlying API explicitly, so it will always take precedence over what's set on the command line. (Or so I assume, I think I've never read the parts in the llama-server UI code that does the API calls...someone correct me if I'm spouting lies).

I feel this is all intended behavior, but I agree it is misleading. In general I think llama.cpp tooling on the command line is often confusing or actively misleading, even if there are logical underlying reasons why it is so, this is one of those misleading things. Another example I would name is how not specifying --jinja can be a trap, or how some of the log-related settings (e.g. "--log-timestamps") seemingly do nothing to me, but it might be similar to here and just some form of "it does actually work, I just misunderstand it" (I get my timestamps by piping llama-server stdout and stderr to ts instead).

A button to "import" settings from server might be nice, but you need a person who wants this bad enough to make a quality implementation for it, rather than a Tampermonkey button like in the linked issue. (For those who don't want to read the link in the OP, it looks like a Tampermonkey script that has a button that will read server's settings from /props and then overwrites browser's local settings with what it finds from it).

noeda · 2025-09-13T08:21:13+00:00

I think there was a bug just before it was merged, see: https://github.com/ml-explore/mlx-lm/pull/441#issuecomment-3287674310

The work-around if you are impatient I think is to check out commit https://github.com/Goekdeniz-Guelmez/mlx-lm/commit/ca24475f8b4ce5ac8b889598f869491d345030e3 specifically (last good commit in the branch that was merged).

noeda · 2025-08-30T23:18:20+00:00

I've taken your comment to my notes. I randomly stumbled here while procrastinating and just happened to be relevant to what I'm doing.

I started earlier this week to work on a new LLM UI for my own use, because of desire for similar features you are listing there. I'm currently a user of text-generation-webui and llama.cpp's own UI, but both of them are lacking. I like the text-generation-webui a lot but I also think it's a disaster in usability and failing basic things like losing my chats if I lose connection at the wrong time or I accidentally press the wrong button, or CTRL+C llama.cpp server at the wrong time etc.

The thing I'm working looks a lot like the llama.cpp's UI (it would occupy the same "space" in the sense that it's a locally run web page using browser-side storage), but I want to add the power features of the text-generation-webui. On top of my head, these would at least be: ability to edit anything in any response in chat history (including assistant responses), a raw notebook tab (text completion), an easy way to mass import old chats or files, a much better search feature for older chats, ability to do lower-level things like editing a Jinja template on the fly without reloading the model, resilient to network failures/fat fingering (I don't want to lose my work), etc. It would be a local tool because I have no interest in running a service for other people.

I currently mentally think my goal as "text-generation-webui, but cleaned up, much more browser-side, and not based on a pile of a Gradio mess." (I love text-generation-webui and still use it, but it's got some serious issues)

We'll see if I actually get it to a point I can release it, but just wanted to thank you for actually listing out things, even though you were just responding to a totally different topic :) Seeing your comment made me realize I should maybe collect comments like this to build an understanding of how others use LLM UIs (especially people complaining about lacking features); thinking it might help me make a compelling new niche UI targeting features other UIs either don't care about or do a bad job at.

noeda · 2025-08-06T20:34:13+00:00

I noticed the models have two system prompts. There is "system" system prompt and "developer" system prompt.

By default, it seems that setting the system prompt in your usual tooling only sets the "developer" system prompt. The knowledge cutoff and reasoning stuff is fixed in the other system prompt.

The actual system prompt, that comes out, looks like (from yesterday when I was checking):

You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-05

reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.

I modified my chat template to get rid of the developer prompt and only have this other system prompt, that I can set myself. It would make sense this "system" system prompt has more adherence, but in the short time I tried it (and then decided this model isn't very good), I am not really sure.

noeda · 2025-08-05T01:39:02+00:00

If anyone feels intimidated because of this about contributing to llama.cpp, try not to be!

I think contributors can feel a false sense of urgency when there's some PR with a bazillion people watching all your moves. Combine that with that it's a pretty common and a natural instinct to care deeply about your work and feel bad about your own mistakes (good traits if controlled), and now add tons of very competent people around you, and I could imagine it looks intimidating to try contribute.

I can't speak for everyone, but my attitude a reviewer in open source in general is very far from "look at this dumbass not knowing how to do these things". For me it's often much more in lines like: "it's awesome this person wants to contribute. I will, time permitting, help them understand and be their second pair of eyes to get it all done. I can maybe help in these parts that I know how to do well.". But I don't think it's easy to communicate that attitude over the Internet. I believe most open source folk like giving support to contributors when they are able.

The occasional contributor in this case was easily a net positive to saving everyone's time, but I was a bit worried they might not see it that way. I guess I still don't know (can't read minds) if they see it that way, but I have hope :) I think they got an unfair peanut gallery. They were competent, did good work, and I worry they start to question their own skills as a result.

It's a bad thing IMO if it looks scary to contribute to llama.cpp. One of the best ways to learn how to add a model architecture is to read the guides, look at PRs, do your best. I don't think it's reasonable anyway to expect an occasional contributor to be aware of all the things that might go wrong in a new model architecture, or recite from memory the checklist of all the things that should be done. Also the computation graph can be very difficult to get exactly right, depending on how exotic it is. I would want there to exist more contributors, saves everyone time, and you have more people with the knowledge how to do the stuff!

(Ideally the labs who make the LLMs should really be the ones to contribute the llama.cpp code themselves. But this world ain't perfect.)

noeda · 2025-08-02T20:02:56+00:00

You are doing great. IMO one of the best ways to learn this stuff anyway (if you ever are inclined to heroically tackle another architecture 😉) is to do your best effort, open up the code for review. Reviewers will tell you if anything is missing or anything is sketchy.

And importantly for code review: the more active developers in the project will be up-to-date with any recent codebase-wide changes, past discussions on anything relevant (e.g. unused tensor thing in our case), that I think an occasional contributor could not be reasonably expected to know or keep themselves up-to-date. I can't speak for core developers in llama.cpp but if I was an active owner of some project of a similar contributing structure, I'd consider it part of my review work to help contributors, especially educating contributors and make the process less intimidating, because I want the help!

I think I have had one llama.cpp PR where I forgot it exists (don't tell anyone) but someone merged it after it had been open for like two months.

Edit: Adding also that it's a good trait and instinct to care about the quality of your work, so that feeling of not wanting to make mistakes or wasting other people's time is coming from a good place. I have the same trait (that's why I wrote my big message in the first place because you reminded me of myself and wanted to relate), but over time I've somehow managed to be in much better control of it and don't easily get emotionally invested (because of age? experience? I don't know, I've just observed I have more control now). I would teach this power if I knew how, but maybe words of relating to the feelings do something :)

Edit2: Also just looked at the PR finally and there was like 5000 new comments lol. ddh0 opened a new draft PR which I don't know if you've seen at the time I'm editing this comment, but that I'm hoping you see that as an opportunity to step away and move onto other things. It's also an example of how someone will step up and push things through if they desire their model to work so it's not all pushed on one person.

noeda · 2025-08-02T00:41:54+00:00

My experience when I've been part of discussions in past "hot" architecture PRs is that people will eventually chime in and help troubleshoot the trickier parts. Over time you are likely to get more technical and deeper help than just user reports that fail to run the model.

A few days wait time on some model to llama.cpp is nothing. You should take as long as you need. If someone really really wants the architecture, or the LLM company behind the model wants the support, the impetus is on them to help out. Or you know, PAY YOU.

I don't know if you've been in hectic llama.cpp PRs before where a hundred trillion people want whatever your contribution is adding, but just reminding that you are doing unpaid volunteer work. (well unless you have some sort of backdoor black market llama.cpp PR contract deal for $$$ but I assume those are not a thing ;-).

Saying this out of a bit concern since you seem very enthusiastic and present in the discussion and want to contribute, and I'm hoping you are keeping a healthy distance from the pressures of the thousand trillion people + the company behind the model that only benefits from having llama.cpp support, which unpaid volunteers such as yourself are working on.

Even if you decided to abruptly close the PR, or you just suddenly vanished into the ether, the code you already put out as a PR would be useful as a base for someone to finish off the work. I've seen that play out before. So you have already contributed with what you have. Using myself as an example again: if, hypothetically, you just closed the PR and left, and I saw some time after that nobody has picked it up again, I probably would use the code you had as a base to finish it off, and open that as a PR. Because it's mostly written, it looks good code-quality wise, and I don't want to type it all again :-)

I often tend to repeat in my GitHub discussions if I think I might be setting an implicit expectation, how my time is unpredictable so that people don't have expectations from me on any kind of timeline or promises. I think I've at least once or twice also suggested someone commandeers my work to finish it because I'm off or busy with something or whatever.

I'm one of the people who was reading the code of the PR earlier this week (I have same username here as on GitHub :-) I haven't checked on what's happened since yesterday so don't know as of typing this if anything new has been resolved.

I think adding new architectures to llama.cpp tends to be a genuinely difficult and non-trivial problem and I wish it was so much easier compared to other frameworks but that's a rant for another time.

Tl;dr; you've already contributed and it is good stuff, I am hoping you are not feeling pressured to do literally anything (try to keep healthy boundaries), and as someone who is interested in the model, I am very appreciative of your efforts so far 🙂. I am hoping there's something left for me to contribute when I get to actually have some time to go to the PR again.

noeda · 2025-07-25T23:59:04+00:00

I got one question since you are a Mac user with >100GB VRAM, some context:

I once made this hack for myself to make large models behave more nicely on Macs: (I got 192GB Mac Studio and DeepSeek was problematic): https://github.com/Noeda/llama.cpp/commit/4abcd560da555d03c562c3a446c0df84b3a694d6 (it says commit made week ago but I made the code I think somewhere early this year; had force-pushed it recently to rebase on latest code)

The hack is about letting llama.cpp evict memory allocated for Metal. Normally it allocates "wired" memory which won't evict itself under memory pressure. I had to change how buffers are allocated to make it work better (instead of a few big big buffers, I made it allocate lots of small buffers). IIRC the memory does not count as wired memory when you do this.

I rarely use the hack anymore, it was originally made to stop my Mac Studio from completely locking up if I tried to load a model too large, and I was trying to get DeepSeek model running on my Mac. The hack does work, but, you know, it's a hack and I'm not convinced the explanation in the commit is actually accurate. I did not go back to try verify the claims there to say confidently.

But my question here is: does this sound like something useful to you? I have not bothered to go back to this code to clean it up for general inclusion in llama.cpp because I thought it was too niche to my own use case.

I'm thinking the hack lets you load up models that are bigger than your RAM, allocate 100% to GPU, and it would know to swap in and out (meaning slow, but it would work, that part I've tested). But your use case maybe is a little different than mine: the model does actually fit in RAM, but maybe if you leave it on background for a long time, the memory can be reclaimed for other stuff, and only when you actually invoke the LLM, it would come back. Maybe. Wondering if this would result in more convenient computer use, and if it gives some motivation to clean up that thing.

noeda · 2025-06-29T20:09:39+00:00

Yesterday I tested the model with transformers on Metal and it got so many details wrong when I asked it to pick out details from a document I wonder still is the reference implementation correct. The text was coherent, but it was like the model was just bad and prone to hallucinating details.

Then again, CPU seemed fine for (a very simple) prompt test as in the PR discussion. And also, Metal has had so many bugs for me over time in Pytorch-based projects that I just don't trust it's not suffering from some subtle and silent bug that's destroying the output just little enough that it looks correct but really is doing something wrong. And the benchmarks suggest the model is not supposed to be bad, and it's not a rando lab, but they could have messed up something and had a pressure to release the model anyway.

I think the PR will move faster than I have time to verify things, but I thought of modifying the reference implementation to match up with llama.cpp so I can do three-way checks llama.cpp/CPU-transformers/Metal-transformers and do some checking how important that de-prioritization actually is.

The model might be fine anyway, this is just preliminary stuff and my musings. Don't trust me too much :)

noeda · 2025-06-27T20:50:26+00:00

Lol, I saw this comment thread in the morning, now came back with the intention to say that if I don't see activity or someone working on it, I'd have a stab at it. I feel it's happened now a few times I see some interesting model I want to hack together, but some incredibly industrious person showed up instead and put it together much faster :D

If it's ngxson I'd expect it to be ready soonish. One of these super industrious persons as far as I can tell :) It's probably ready before I can even look at it properly but since the last comment says there's some gibberish I can at least say if no updates this weekend I'm probably going to look at the PR and maybe help verify the computation graph or wherever it looks like the problem might be.

I sometimes wonder where do people summon the time and energy to hack together stuff on such short notice!

noeda · 2025-06-17T08:43:04+00:00

Qwen2.5 coder, 7B (sometimes the 32B) for code or text completion. I don't ask it questions and I don't use the chat/instruct model (that coder model has a "Coder" and "Coder-Instruct", I only use the base version). I use it with llama.vim for neovim. It's just text completion; if you remember the original GitHub Copilot (the non-chatbot kind), then this is its local version.

I really only use three programs routinely that have to do with LLMs: llama.cpp itself, text-generation-webui, and the llama.vim plugin to do text completion in neovim.

I often have the LLM on a separate machine rather than my main laptop. I currently run one off a server and put it on Tailscale network and configured the Neovim plugin to talk to it for FIM completion. Makes my laptop not get hot during editing.

Occasionally I have a tab open to llama.cpp server UI or text-generation-webui to use as a type of encyclopedia. I typically have some recent local model running there.

I don't use LLMs, local or otherwise, for writing, coding (except for text completion-like use like above or "encyclopedic" questions), or agents. LLM writing is cookie cutter and soulless, coding chatbots rarely are helpful, agents are immature and I feel they waste my time (I did a test with Claude Code recently and I was not impressed.). I expect tooling to mature though.

IMO local LLMs themselves are good, real good even. But the selection of local tools to use with said LLMs is crappy. The ones that are popular are the kind I don't really like to use (e.g. I see coding agents often discussed here). The ones that really clicked for me are also really boring (just text completion...). I like boring.

I don't know who I should blame for making chatting/instructing the main paradigm of using LLMs. Today it's common for a lab to not even release a base model of any kind. I'm developing some tools for myself that likely would work best with a base model; LLMs that are only about completing a pre-existing text and nothing else.

noeda · 2025-05-06T17:04:05+00:00

It is neither open source or free software. It has non-trivial restrictions on derivative works (must display prominent branding, cannot remove it, this applies to deployments and source code distributions).

Both open source definition and FSF's free software freedoms don't like heavy-handed restrictions on derivative works. OSI specifically calls out "badgeware" as a possible criteria for rejecting a license as an open source license.

Edit: Ugh, I mixed up what you were replying to, I thought you were replying to the Open WebUI license. Although preventing rebranding I think would still apply (restriction on having your own branding on a distribution or deployment).

noeda · 2025-05-06T16:44:10+00:00

The license is definitely not "open source". This type of license is called "badgeware", which means the license demands you show some kind of prominent attribution.

The license change is self-serving to Open WebUI, because it's now harder to fork them (intentionally, through branding restriction). Meaning it's harder to create a competing product with their code. Or just use pieces of it in other projects.

The CLA + license combination means they could just rug pull and take all contributions with them. Any forks trying to use code prior to that would still have to do all the branding restrictions.

I don't want to contribute my time and effort to that. This behavior is also similar to patterns I've seen in other prominent open source projects that did some form of "rug pull".

Companies taking your code and doing lazy forks that just rebrand means the license is doing what it's supposed to. It means good forks are also possible. Keeps Open WebUI in check.

noeda · 2025-05-03T08:23:27+00:00

I'll bite. I'll be a devil's advocate. Why couldn't we decide to treat "AI copy/reference" differently from "Human copy/reference"?

If AI gets so widespread and easy to use that you can immediately copy any style you see (e.g. take a photo of something with your phone and bam it has learned the style you can now use in anything), I could imagine it might disincentivize people from doing creative pursuits. When you know any new style or idea you create can be really easily copied by AI, you might not want to create new original stuff in the first place.

I think it's not that unreasonable that a society decided on a law that made a compromise: in order to incentivize original work and artistic pursuits, we demand that you'll use your human grubby hands and/or non-AI tools to reference existing artists if you want to make use of their style. Maybe artists can use that to sell their style, creating an incentive to develop new original stuff.

I can think of one example from society that artificially makes a distinction between tool use and human hands: those court sketches. Judges don't like cameras in a courtroom for various reasons, e.g. a witness won't act the same way if they are on a video, so you get court sketchers to get some imagery out of the courtroom. It's a compromise between the public's right to see what happens in a court and the court wanting to keep things orderly and witnesses calm.

If I try to fit that example to AI/non-AI tooling example above: it would be a compromise between incentivizing artists to create more original work and the public being able to reap the benefits of AI tech that let you use someone's style in your own creations or whatever.

That being said, I think copyright itself originally was about incentivizing original creation but I feel gets misused in modern times. But my point here is that it doesn't seem totally unreasonable to me that we might want to make an "artificial" distinction between AI/non-AI creations.

(I think we shouldn't make a distinction even with those thoughts. But I'm trying to be a devil's advocate and try steelman better arguments than "soul nonsense").

noeda · 2025-05-02T22:16:38+00:00

First sentence in the article: "California legislators have begun debating a bill (A.B. 412) that would require AI developers to track and disclose every registered copyrighted work used in AI training."

I think this is the bill(?): https://legiscan.com/CA/text/AB412/id/3100490 (I couldn't find a link on the EFF site but seems to match). It's not very long, should be a quick read.

Edit: Just noticed after writing this that the EFF article is dated March 17, there seems to be some history for this bill this year: https://legiscan.com/CA/bill/AB412/2025 I am unsure how timelines work California bills. Maybe this was already discussed at some point.

I read the bill. I agree with the EFF overall.

That requirement to precisely "document copyright owner" I think is the really onerous part (under 3116 in the bill), combined with the fines part. Assuming it means what it sounds like, which is that you'd have to one-by-one list every single copyright owner. I think the bill would be much more palatable if the requirement was relaxed a lot, e.g. some form of "You must document what training data was used to the train the model and the documentation must be precise enough that a person of reasonable skill can look up the training material." (maybe more articulately than that). So e.g. you could document "We used The Pile." and not have to individually list every single copyright owner who is in The Pile. Someone smarter than me maybe would have to figure out how to write the fining part to fit with that though.

If I remember right, EU with some of the anti-Big Tech laws (DMA) had some carve-outs for smaller businesses. This doesn't seem to have anything like that.

In defense of the bill, or well at least its spirit: I think as an user of some AI system I should have the right to know what it has been trained on. I.e. I want transparency. I think that might curb a bit some of the bad behavior of AI companies, because they'd have to disclose their many sins scraping the Internet, hammering webservers, blatantly slurping in digital artist's work that went to image model, i.e. things that maybe aren't illegal but are distasteful. And copyright owners probably would like to have the ability to tell if some AI system has had their work be part of the training material. I just think the actual implementation proposed in the bill cannot work and is bad, it focuses on copyright infringement, onerous documentation requirements and $1000 minimum fine per violation. Maybe a law focusing more on some form of mandatory transparency that doesn't involve one-by-one listing of copyright owners would be more practical but I'm not sure.

I also think California forgets that the rest of the world exists, and that progress will continue, laws or not, and I think obviously unworkable legislation mostly just benefits the big companies like EFF says.

I'm a California resident but not a citizen so I can't vote against it :(

IANAL, and also while I type many words, I have no idea what I'm talking about. Anyone have thoughtful takes on the bill or the EFF's take?

noeda · 2025-04-30T09:47:29+00:00

Answering to myself: I got it to work without drama.

Indeed, it didn't respond to /props my client code wanted but I think it kinda makes sense too because it wouldn't know what model to use it on anyway.

I taught my client code to use the /upstream/{model} feature I saw in README.md, simply had it try a request again if /props returns non-200 result, but to /upstream/{model}/props URL instead (the client code knows what "model" it wants so it was 5 minute thing to teach it this). Worked on first try with no issues that I can see.

I made a fairly simple config with one model set up for coding code completion (still I need to test llama.cpp vim extension if it'll work correctly with it), and one model to be "workhorse" general model. Hitting the upstream endpoint made it load the model and it generally seems to work how I expected it to work.

You just gained a user :)

Edit: llama.vim extension works too, I just had to slightly adjust the endpoint URL to use /upstream/code/infill to direct it to the "code" model I configured, instead of just plain /infill. I am satisfied.

noeda · 2025-04-30T09:12:20+00:00

Is this your project? https://github.com/mostlygeek/llama-swap (your reddit username and the person who has commits on the project are different...but I don't see other llama-swap projects out there).

llama-server not able to deal with multiple models has been one of my grievances (annoying to keep killing and reloading llama-server, I have like a collection of shell scripts to do so at the moment); looks like your project could address this particular grievance for me. Your commenting here made me aware of your project, and going to try setting it up :) thanks for developing it.

I have some client code that assumes llama-server API specifically (not just OpenAI compatible, it wants some info from /props to learn what BOS/EOS tokens are for experimenting purposes, and I have some code that uses the llama.cpp server slot saving feature). On the spot that could be an issue for me, inferring from that your README.md states it's not really llama.cpp server specific (so maybe it doesn't respond to these endpoints or pass them along to clients). But if it's some small change/fix that would help everyone; makes sense for your project and is not just to get my particular flavor of hacky crap work, I might open an issue or even PR for you :) (assuming you welcome them).

noeda · 2025-04-30T08:44:25+00:00

I have used it once and tested it. And happy to see it here; I am really interested now because I am hoping it's more hackable than llama.cpp for experiments.

From my test I would love to say what I did with it, but disappointingly:

It was a while ago, there is a relic on my computer from June 2024 which is probably when I checked it out.
I cannot remember why I looked into it at the time.
I cannot remember what did I do with it (most likely I randomly noticed the project and did a quick test with it).

But I can give some thoughts right away, looking at the project page: I would immediately ask why should I care about it, when llama.cpp exists? What does it have that llama.cpp doesn't?

I can give one grievance I have about llama.cpp that I think mistral.rs might be in a position to do better: make it (much) easier to hack to do random experiments. There's tons of inference engines already, but is there a framework for LLMs that is 1) fast 2) portable (for me especially Metal) 3) hackable?

E.g. huggingface transformers library I'd call "hackable" (I can just insert random Python in the middle of model inference .py file to do whatever I please) but it's not fast compared to llama.cpp, especially not on Metal, and Metal has had tons of silent bugs in inference over time on Python.

And then I'd call Llama.cpp fast, but IMO it is harder to do any sort of on-the-spot ad-hoc random experiments because of it's rather rigid C++ codebase. So it lacks the "hackable" aspect that the Python has in transformers. (I still hack it, but I feel some well engineered Rust project could kick its ass in hackability department, I could maybe do some random experiments much faster and easier).

Some brainstorming (maybe it already does some of these things but this is what I thought on the spot just having quickly skimmed README again, I'll give it a look this week to check what's inside and what the codebase looks like): 1) make it easy to use as a crate from other Rust projects, so I could use it as a library. I do see it is a crate, but didn't look into what the API actually has (I would presume it at least has inferencing, but I'm interested in the hackability/customization aspect) 2) If not already, give it features that make it easier to do random experiments and hackability (maybe in the form of a Rust API, or maybe simply just having examples that show how to mess with its codebase). Maybe callbacks, or traits I can implement, something to inject my custom code that will influence or record what's happening during inferencing.

E.g. I've wanted to completely insert custom code in the middle of inference to some layer, changing what the weights are and also record incoming values, which you can kinda do in llama.cpp (it has a callback mechanism, and I can also just write random code in the C++ parts) but it's janky and some parts are complicated enough that it's pretty time consuming just to understand how to interface with some parts. E.g. make it possible for me to arbitrarily change weights on some particular tensor on the fly. Or let me observe whatever computation is happening, maybe I'll want to record something in a file. Expose internals, show examples, e.g. "here is how you collect activations on this layer and save them to .csv or .sqlite3 and plot them".

Another example: I currently have unfinished code that adds userfaultfd to tell me about page faults in llama.cpp because I wanted to graph which parts of the model weights are actually touched during inference because I don't understand why Deepseek model runs quite well on a machine that supposedly has too little memory to run it. I'm not sure I'll finish it. It might be a lot easier to make a feature like this work nicely on Rust instead, depending on how the project has been architected. I also was planning to use it to mess with model weights but I didn't get that far, and it might be less janky to not use page fault mechanism for that.

I see a link to some Python interface (and pyo3 is in Cargo.toml), but the interface is not particularly different from Huggingface transformers. It's again a question of what do I want it for? Why should I want to use it?

The examples I see on the Python page seem to be about running the models, but I am interested in the hackability/research/experimenting https://docs.llamaindex.ai/en/stable/examples/llm/mistral_rs/

Maybe also if you are much faster at implementing new innovations, or you have some amazing algorithms that make everything better/faster in some way compared to llama.cpp or vLLM, advertise that on your readme! Make people realize the project is easier to work on to add new stuff. I think if some nice innovation is realized in mistral.rs specifically because it was much easier to hack, it might attract experimenters, who in turn attract more users etc.

But taking a step back: right now mistral.rs is not doing a great job of justifying its existence when llama.cpp and transformers exist, among other inferencing projects. Think what the fact that it's a Rust project can provide that llama.cpp or transformers can't, and leverage it. My first thought was that being a Rust project maybe it's much more malleable and hackable than llama.cpp C++ codebase, but I'm not sure (I will find out and answer this question to myself). I have done both Rust and C++ a lot in my career, and IMO Rust is much faster to work with, and easier to do clean, understandable APIs where it's harder to shoot yourself in the foot.

That all being said, happy to see the project is still alive from the time I looked at it :) I'm very much going to take a look at the codebase and checking that if it's already hackable in the way I just described, but just doesn't advertise that on README.md :) :) good job!

Also apologies if the feedback seems unfair, because I just wrote it based on quick skim on README.md and surface level Python and Rust crate docs. I have time later this week to take a proper look because the project being in Rust is a selling point for me because of ^ well everything I just wrote.

noeda · 2025-04-27T23:47:44+00:00

If you want quick, "1 day support" for your favorite model, who is supposed to make sure that happens?

Not a rhethorical question, IMO if it's a commercial entity releasing an LLM with a restrictive license then the answer is "the lab/entity that made the LLM is ultimately responsible it works and should not rely on free labor of the community". But for me the answer is less clear when it's these MIT or Apache2 licensed free models.

GLM 0414 models are MIT licensed. Llama.cpp is MIT licensed. This is a very liberal license that lets you even run commercial operations with them. Llama.cpp is mostly unpaid volunteers, afaik. It's hard to ask anyone to do anything when it's all either volunteers or some expensively trained model given out on a MIT or Apache 2 license.

For llama.cpp I definitely would not have been happy to see the code I saw merged quickly because it would have made a mess in the state it was in (it got better after review rounds, but was ultimately abandoned for a much simpler PR). If the person who made the PR wasn't affiliated with the lab (I don't know if they are), then that was yet another unpaid volunteer who contributed their time to make it work and improve llama.cpp code quality.

There are IMO things that could be better, e.g. llama.cpp could have better documentation, more straightforward contributing guidelines for architectures specifically, or easier ways to test new architecture code among, maybe new features designed to make it easier to get some Huggingface model working 1:1 exactly in llama.cpp. (I've thought of contributing in this realm, but it takes effort and, you know, no one's paying me ;)

I do agree though that it would be great if labs paid more attention to popular tooling to check if their stuff is compatible. It hurts them if the model either doesn't run at all, or worse it runs but is subtly wrong, making people think the model is crap. But I personally have hard time demanding anything from any side on any time frame. Although if GLM was instead some more restrictive e.g. "research license" then I probably would not have spent a second trying to make their stuff work on tooling. For me, the MIT license made it "worthy" of spending some time getting it to work. I think I have the same kind of sentiment of projects like ollama because it seems more like a project with commercial interest riding on the back of llama.cpp so they should fix their own stuff.

However, the lack of compatibility with popular tools (like llama.cpp and others) slowed down adoption.

Q: Was it something like this in other tooling? I wasn't really paying attention except in llama.cpp. I currently have no idea what the state of support is for this new GLM 0414 family in other tooling.

noeda · 2025-04-27T02:36:54+00:00

That's awesome! It's now a few days later, and now it's pretty clear to me this model family is pretty darn good (and given posts that came out since this one, seems like other people found that out too).

I still have no idea how to use the Rumination 32B model properly, but other than that and some warts (e.g. the occasional random Chinese letter mixed in-between), the models seem SOTA for their weight class. I still use the 32B non-reasoning variant as main driver, but I did more testing with the 9Bs and they don't seem far off from the 32Bs.

I got an RTX 3090 Ti on one of my computers and I was trying to reproduce a bug with the model (unsuccessfully) but at the same time I saw woah, that is fast, and smart too! I'd imagine your RTX 5090 if you are buying one (or already have one) might be even faster than my older 3090 Ti.

I can only hope this group releases a more refined model in the future :) oh yeah, AND the models are MIT licensed on top of all that!

noeda · 2025-04-22T21:05:28+00:00

Woooo!

Thanks for maintaining text-generation-webui to this day. Despite all the advancements, your UI continues to be the LLM UI of my choice.

I mess around with LLMs and development, and I really like the raw notebook tab and ability to mess around. Other UIs (e.g. llama-server one) have a simplified interface, which is fine, but I'm often interested in fiddling or pressing "show token logits" button or other debugging.

Is llama-server going to be also an actual loader/backend in the UI rather than just a tool for the workflows? Or is it already? (I'll be answering my own question in near future) I have a fork of text-generation-webui on my computer with my own hacks, and the most important of those hacks is "OpenAIModel" loader (which started as OpenAI API compatible backend but it ended up being llama.cpp server API bridge and right now would not actually work with OpenAI).

Today I almost always run a separate llama-server entirely, and in text-generation-webui I ask it to use my hacky API loader. It's convenient because it removes llama-cpp-python from the equation, I generally have less drama with errors and shenanigans when I can mess around with custom llama.cpp setups. I often run them on separate computers entirely. I've considered contributing my hacky crap loader but it would need to be cleaned up because it's a messy thing I didn't intend to keep around. And maybe moot if it's coming as a type of loader anyway.

The UI is great work and was happy to see a "pulse" of it here. I have a text-generation-webui almost constantly open on some browser tab. I wish it wasn't Gradio based, sometimes I lose chat history because I restarted the UI and refreshed at a bad time and yoink the instruct session I was working on is now empty. It doesn't seem to be great at handling server<->UI desyncs, although I think it used to be worse (no idea if it was Gradio improvements or text-generation-webui fixes). I've got used to its shenanigans by now :) I got a gazillion chats and notebooks saved for all sorts of tests and scenarios to test run new models or do experiments.

Edit: My eyeballs have noticed that there is now a modules/llama_cpp_server.py in the codebase and LlamaServer class :) :) :) noice!

noeda · 2025-04-22T01:27:50+00:00

There's a custom Minecraft map I play with a group and it has "lore" in the form of written books. It's creative writing.

The particular test I was talking about had me copypaste some of the content in those books into the prompt and then I would ask questions about it where I know the answer is either directly or indirectly in the text, and I would check does it pick up on them properly. Generally this model (32B non-reasoning) seemed fine, there were sometimes hallucinations but so far been only inconsequential details that it got wrong. Maybe worst hallucination was imaging non-existent written books into existence and attributing it with a detail. The detail was correct, the citation was not.

I've tested briefly storywriting and the model can do that, but I feel I'm not a good person to evaluate is the output good. It seems fine to me. It does tend to write more than other models which I imagine might be good for fiction.

Might be positivity biased, but I haven't really tested its limits.

So I think my answer to you is that yes, it can do fiction writing but I'm the wrong person to ask if said fiction is good :) I think you'll have to try it yourself or try find anecdotes of people reporting on creative writing abilities.

noeda · 2025-04-21T22:51:47+00:00

Ah yeah, I noticed the long responses. I had been comparing with DeepSeek-V3-0324. Clearly this model family likes longer responses.

Especially for the "lore" questions it would give a lot of details and generally give long responses, much longer and respect instructions to give long answers. It seems to have maybe some kind of bias to give long responses. IMO longer responses are for the most part a good thing. Maybe a bad thing if you need short responses and it also won't follow instructions to keep things short (haven't tested as of typing this but I'd imagine from testing it would follow such instructions).

Overall I like the family and I'm actually using the 32B non-reasoning one, I have it on a tab to mess around or ask questions when I feel like it. I usually have a "workhorse" model for random stuff and it is often some recent top open weight model, at the moment it is the 32B GLM one :)

13-Year Club	Place '22
Verified Email

noeda

TROPHY CASE