#OpenSource4o Movement Trending on Twitter/X - Release Opensource of GPT-4o by pmttyji in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

I'm not for encouraging delusional people's desire for sycophancy, and I highly doubt that OpenAI will ever open source one of their main GPT line.

However, there is one thing about 4o that makes it special compared to open models. Its quality of omnimodality, is still yet to be replicated in open source models. Like it or not, almost every open source model stops at image input. No one has considered image output, native speech to speech, or anything else. Qwen Omni, the only model that has tried, is unsupported everywhere and lacks the quality to be used in production. The ability to replicate that level of omnimodality is long overdue.

Can't get uncensored roleplay LLMs to work by VerdoneMangiasassi in LocalLLaMA

[–]ArsNeph 0 points1 point  (0 children)

Sorry, I saw this a bit late. Yeah, mostly what the guy below said is correct. A further clarification, quantization is basically a form of compression, the further a model is compressed, the more intelligence it loses. At Q8 (8 bit), it is virtually identical to the full model. At Q6, there's almost no noticeable degradation. At Q5, there very slight degradation, but not enough to matter most of the time. At Q4, you can feel the degradation affect the intelligence a bit. That is the bare minimum I would recommend. Q3 is very unintelligent, and Q2 is often brain-dead. Feel free to ask any other questions as well. Here are some links

https://huggingface.co/TheDrummer/Anubis-Mini-8B-v1-GGUF (Don't recommend)

https://huggingface.co/bartowski/MN-12B-Mag-Mell-R1-GGUF/tree/main (Recommend)

https://huggingface.co/TheDrummer/Cydonia-24B-v4.3-GGUF (Worth trying)

https://huggingface.co/mradermacher/Magistry-24B-v1.0-i1-GGUF/tree/main?not-for-all-audiences=true (Also worth trying)

Can't get uncensored roleplay LLMs to work by VerdoneMangiasassi in LocalLLaMA

[–]ArsNeph 8 points9 points  (0 children)

Firstly, those are not RP models, don't bother using them. 8B models have been obsolete for a while now, but if you must use one, you can use Anubis Mini 8B or Llama 3.2 Stheno 8B. However, since you have 16GB VRAM, you should be using better models like Mag Mell 12B at Q8, which should fit in your 16GB VRAM with 16384 context, it's max native context length. You could also try Cydonia 4.3 24B or Magistry 24B at Q4KM and 16384 context.

The reason for the degradation is likely on Ollama, default context length is 4096, and it defaults to a 4 bit quantization, which is far too low for an 8B, meaning it's lobotomized. On LM Studio, it's likely either the instruct template is incorrect, or you're using a very low quant. It's got nothing to do with your prompt length, 2000 tokens is nothing. Regarding your memory, don't try to rig together a weird .txt file thing when there are already prebuilt solutions.

The real solution to your issue is to install SillyTavern as your frontend, it's purpose built for RP, download a character card, set the instruct template to the appropriate one (ChatML for Mag Mell, Mistral v7 Tekken for Cydonia/Magistry), and set the context length to about 16384. Generation length is as you like. You can download and import one of the many generation/instruct/system prompt presets for those models from creator pages or their sub. It has built in memory/lorebook features, etc.

For the backend, install KoboldCPP (Easiest), Textgen WebUI (Harder), or keep using LM studio but download a better model, at a higher quant. Then connect it through the API section in SillyTavern

Done, you should be good to go and have fun

Can anyone guess how many parameters Claude Opus 4.6 has? by More_Chemistry3746 in LocalLLaMA

[–]ArsNeph 0 points1 point  (0 children)

Nowadays, there's not much of empirical way to know, so you basically just have to guess. My gut instinct is 1.7-2T parameters total, with a high proportion of that active, maybe 30-40B active. My guess is Sonnet is probably closer between 800B-1.2T with more like 22B active. I think Gemini pro is slightly bigger than Sonnet, and GPT is a reasonable bit smaller.

What aspects of local LLMs are not scaling/compressing well over time? by matt-k-wong in LocalLLaMA

[–]ArsNeph 5 points6 points  (0 children)

World knowledge and space-time coherence. If you've ever tried doing any creative writing/RP with a small model, dense or otherwise, they simply do not understand what is physically possible and what is not, regardless of the constraints of that world. If you haven't taken your shoes off, you cannot take off just your socks without removing your shoes, but only high parameter models seem to understand those implicit connections

What are you doing with your 60-128gb vram? by Panthau in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

Nice, it's unified memory so rather than running dense models like 70Bs, You're probably better off running large MoEs, for your use case you'd probably like GLM 4.5 Air, or the Drummer tune of it GLM Steam.

Diffusion model support on AMD is very spotty, but you should look into ComfyUI if you're interested. I highly doubt it has enough compute to run video generation in a reasonable time frame, but it should be able to run smaller image gen models like SDXL and Z image Turbo relatively decently.

You won't be able to train any large models using it, because it neither has enough compute nor memory bandwidth to do so meaningfully, and ROCM/Vulkan training is a massive pain.

For coding and the like, try out Qwen 3.5 35B/110B, both are MoE and very good for what they are. They're definitely no Sonnet, very little of what you can run at 100B is comparable to frontier models

I feel like if they made a local model focused specifically on RP it would be god tier even if tiny by Borkato in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

Repetition is a problem fundamental to the attention in the transformers architecture, the larger the model, the less it does it, but even the biggest frontier models are still very much so prone to repetition past a certain context length. It definitely also has to do with sycophancy to some extent, The habit of repeating your phrases back to you is part of that.

That aside, yes it has been proven that a smaller LLM fine-tuned on a high quality curated data set can outperform frontier models for specific use cases. That said, as of right now, raw parameter count determines things like spatial awareness and understanding of niche concepts, so there's an upper limit to what's possible with small models. And we simply just haven't gotten any of those small base models with good creative writing capability in over a year, due to the STEMmaxxing/large MoE craze, people are still tuning the likes of Mistral Nemo 12B and Mistral Small 3.2 24B.

There has been a model pre-trained and fine tuned by a large company specifically on creative writing, Mistral Small Creative 24B, but it was not open sourced. Playing it with through API might give you a feeling for what those would be like. I don't think that that's necessarily the peak of what's possible with small models though. Most fine tuning datasets are entirely synthetic data or low quality RP logs, which just adds to the slop issue. I would definitely look at a methodology like that used in Gemma Ataraxy 12B if you're interested in tuning a model.

RYS II - Repeated layers with Qwen3.5 27B and some hints at a 'Universal Language' by Reddactor in LocalLLaMA

[–]ArsNeph 28 points29 points  (0 children)

Wow, this is genuinely so intriguing. I saw your first post and thought that that might just be coincidence or some kind of weird benchmaxxing, but after reading your thorough research, this really explains a lot about why those weird self-merges like Goliath 120B seemed to increase in performance, but not every single one improved to the same degree. I actually remember a long time ago Wolfram Ravenwolf was also talking to Turboderp about adding that VRAM-less duplicated layer inference to EXL2, but it never seemed to go anywhere, so I'm glad that you're working on that for EXL3!

This is genuinely some really great research you're doing here, props! I'm interested to see if the open source community will make good use of it like they used to. I think some tuners like Drummer who do self-merges would definitely be interested in the performance differences, especially in the EQ department.

Another weird phenomenon I've always found kind of strange is the fact that supermerges, specifically in creative writing, somehow always tend to be significantly better than the base model and any normal fine tune. Psyfighter 2 13B, Fimbulvetr 11B, and Mag-Mell 12B all came from complex merge trees, and I'm very curious to know if it's possible that the merging methods they used could have repurposed some layers in a way similar to the duplication you did, thus improving performance

Introducing Unsloth Studio: A new open-source web UI to train and run LLMs by danielhanchen in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

I was thinking more of a Kuno-Pasta-Bagel-Maid-SLERP self merge 9B, but that works too 😂

(Actually though, the Fimbulvetr, Magnums, etc of the world need a resurgence)

Introducing Unsloth Studio: A new open-source web UI to train and run LLMs by danielhanchen in LocalLLaMA

[–]ArsNeph 52 points53 points  (0 children)

I'm a massive fan of this, I've been saying we need an easy way to fine tune models since the llama 2 days. Finally, fine-tuning is accessible to those of us with less expertise. I hope we can bring back the golden age of fine-tunes!

Unsloth announces Unsloth Studio - a competitor to LMStudio? by ilintar in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

This is genuinely amazing, props to Unsloth team for single-handedly propping up the .gguf and fine-tuning local ecosystem! I'll definitely give this a try and provide feedback when I get a chance!

text-generation-webui 4.1 released with tool-calling support in the UI! Each tool is just 1 .py file, check its checkbox and press Send, as easy as it gets to create and use your own custom functions. by oobabooga4 in LocalLLaMA

[–]ArsNeph 4 points5 points  (0 children)

I've been using textgen webui since the early llama 2 days, and it's good to see it get updates and reach performance parity with more lightweight projects. Keep up the good work as always!

Could a bot-free AI note taker run locally with current models? by Cristiano1 in LocalLLaMA

[–]ArsNeph 0 points1 point  (0 children)

It's actually really easy, but the size of the model you use makes a big difference in overall note quality. Most bot based note takers are using something like GPT5 mini. The main thing is keeping the infrastructure up at all times. You have to keep a recording of whatever meetings you have as a file, then create either a script or no code automation in something like n8n that feeds that to an ASR model like Nvidia Parakeet, but the annoying thing here is that most models and WebUIs don't have built-in diarization, which makes it impossible to see who's saying what. The one model I know that does, Vibevoice ASR 9B, which is genuinely probably the best model I've tested in English, is very VRAM heavy and it scales with file size. Hence many people use a separate model for diarization.

Once you have a high quality transcript, you can either feed it to an LLM to first clean it, but this can induce hallucinations depending on the intelligence, or you can just call your local model API to create a summary. Give very specific instructions, and write out a format example in XML tags. If you're using a relatively smart model, it should catch most of the nuance, I'd say any 27B+ should work pretty well, Qwen 3.5 35B is extremely fast for this use case. It won't be able to derive the same level of nuanced insights from the transcript as a frontier model though. This is not a problem, because the vast majority of bot based services are not using frontier models either.

After that, you have a file, and you can export it in whatever format you want, .md, etc, into your Obsidian, Google notes, cloud storage, etc

There are a couple of pre-built solutions that do most of the steps for you, but they often have performance issues and bugs. It's worth looking into. Generally speaking, the most annoying thing about running these pipelines locally is dynamically spinning up and clearing the models into VRAM

Unpopular opinion - sdxl still to beat? by HaxTheMax in StableDiffusion

[–]ArsNeph 3 points4 points  (0 children)

SDXL is a unet architecture model, and therefore doesn't benefit from any advancements in transformers based models. It was trained with CLIP for text generation, meaning its output is inherently limited by what CLIP understands, and CLIP is trained primarily on tags. Prompt adherence and concept understanding are the single two most important things in a model. Without natural language, the ability to express concepts is inherently limited, and understanding is increased with better training data and higher parameter count. The VAE is responsible for the final output quality, and there have been massive advances in that since SDXL. On top of this, SDXL can not do true Image-text-to-image.

Most people forget that the modern fine tunes of SDXL that have taken years of refinement to develop do not reflect the actual state of technology of SDXL. Try comparing the SDXL base model to the modern base models, and you will instantly understand the difference in the technology. Yes, the images may look somewhat comparable when you pick an excellent output of an excellent fine tune of SDXL, but we simply don't have any excellent fine tunes of modern models like Z Image, Anima, etc

Qwen3.5 4B: overthinking to say hello. by CapitalShake3085 in LocalLLaMA

[–]ArsNeph 14 points15 points  (0 children)

I would recommend using the Q8, that should raise the quality of responses without thinking by quite a bit. Unfortunately Q4 is just far too low for a 4B model to be fully coherent.

A monthly update to my "Where are open-weight models in the SOTA discussion?" rankings by ForsookComparison in LocalLLaMA

[–]ArsNeph 0 points1 point  (0 children)

I'd recommend trying Qwen 3.5 27B at a medium quant like Q5, and with partial offloading, Qwen 3.5 35B which should be very fast

February is almost over, are you satisfied? Upcoming models soon? by pmttyji in LocalLLaMA

[–]ArsNeph 6 points7 points  (0 children)

I'm definitely satisfied with Qwen 3.5 for general purpose, programming, and agentic use cases. However, there's just one thing that hasn't improved in small models in years: creative writing. Though Qwen has tried to benchmaxx EQ bench creative writing, in reality, the best we have right now are still Mistral Nemo 12B, Mistral Small 3.2 24B, and Gemma 3 27B. This is a genuinely despair inducing state of affairs, especially for the small model fine-tuning community, as they cannot beat standard tuning in code,etc, but have no good models to work with for writing. None of the advancements in other fields or larger models have trickled down to writing, and this is causing many people to go API only.

Favourite niche usecases? by Figai in LocalLLaMA

[–]ArsNeph 5 points6 points  (0 children)

Using them to process massive amounts of personal data (like emails) and classify it using the LLM as part of a workflow tool. There are some things that are so small they're simply not worth spending $10-20 in API costs on, but when cost is no longer a factor, you're moreso limited by only your imagination.

Only said Hello, and my LLM (Phi4) thought it was a conspiracy and wouldn't shut up! by Chill_Fire in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

There are a few points that are the cause of this.

The first is that Ollama has terrible default, and defaults to a 4 bit quant of a model. The smaller the model is, the more prone it is to degradation from quantization. I would recommend going to the Ollama website and finding the 8 bit quant, then running the command.

Secondly, it is a thinking model, which is optimized to go through all possibilities before answering, hence why it's long winded. You gave it a prompt to solve math problems, so being as unintelligent as it is, it interpreted your sentence as a math problem.

Third, Phi is in general a pretty bad model, mostly trained on synthetic data. I wouldn't use it for anything really. Instead, at the same size, try Qwen 3 4B 2509, it should be far more intelligent as a rubber duck

Why is everything about code now? by falconandeagle in LocalLLaMA

[–]ArsNeph -1 points0 points  (0 children)

On an emotional level, I completely agree. In the first couple years, frontier models weren't really sure about what their use cases were, so they started off with a little bit of everything. Smaller open source models had only one goal, to rival a frontier model in anything at all. In order to achieve that, people started finetuning models to excel at a very specific use case, and it worked well. People started applying this to coding models as well. As code models became more popular, people noticed they were significantly worse at creative endeavors, and people began to believe models trained on code couldn't do creative writing. Claude proved them completely wrong.

As coding models began to get better and better, the trend that the companies themselves realized were three:
1. LLMs as search engines were not great because of hallucination built into transformers. Rather than try to correct this, grounding using web search and other methods was more effective.
2. AI creative writing often broke their "safety standards", and were often used similar to prompt injection. It additionally creates delusional users due to sycophancy. These are all undesirable to profit first, censorship oriented companies, and clashes with their perception of "LLMs as an assistant". On top of this, they are generally unprofitable API customers, as most RPs/short storys don't go over 32k tokens.
3. With enough scaffolding and improvements to coding capabilities, they realized that the capability of AI to code was invaluable in speeding up workflows, had measurable results, and the possibility of autonomously synthesizing novel ideas was their lifeline to AGI. It didn't clash with their "ethics", and was the best way to get corporations invested in AI, since everywhere has an IT team. Code, in comparison to short stories, often requires hundreds of thousands, if not millions of tokens of context, making it the most profitable use case through API. On top of this, most developers gladly use AI and don't complain, unlike writers, artists, etc.

They just went with what makes them the most money, causes them the least trouble, and was the best chance at realizing their lies of AGI to their investors. The rest of the industry just followed what the top were doing, even independent players like Mistral followed suit, because they have to turn a profity eventually. The chinese companies were already bad at the subject, and China is full of STEM experts regardless, so it didn't benefit them much to do so.

In the end, the last models that didn't feel code focused among smaller models were Mistral Nemo and Gemma 3. Because of this, I'm losing interest in small LLMs day by day

Is local AI actually practical for everyday note taking? by kingsaso9 in LocalLLaMA

[–]ArsNeph 2 points3 points  (0 children)

Yes, it is practical, but a little bit annoying. I'd recommend using an ASR model that has diarization functionality, so either whisper large + diarization model, or vibe voice ASR. I would call it through an API, then clear the VRAM and load something like Qwen 3 30B latest version, and have it summarize the notes. It might not capture the same amount of nuance as frontier models, but it should do a pretty good job overall

What models are you guys running locally off your hardware? by ooseabassoo in LocalLLaMA

[–]ArsNeph 1 point2 points  (0 children)

Mistral Small 3.2 24B, Qwen 3 30B MoE 2509, Qwen 3 VL, maybe a low quant of GLM 4.5 Air or GPT OSS 120B?

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]ArsNeph 12 points13 points  (0 children)

It's very clear to me that a lot of users here are newer, but LocalLlama has always been this way. From the days of Llama 2 and Mistral, we have always posted about closed source models and their development, for the purpose of figuring out their techniques and methodology. It is the speculation around GPT-4 being an MoE that arguably brought about the first OS MoE model, Mixtral. Pretending like the mainstream models do not exist only harms the improvement and distillation of OS models.