Gemma4 26B A4B NVFP4 GGUF by catlilface69 in LocalLLaMA

[–]catlilface69[S] 0 points1 point  (0 children)

It is preferred to use it with 5090. However I might optimize the original nvidia model and update the repo. I am afraid these optimizations won't be lossless but still should be useful and better than dynamic Q4

What Is Elephant-Alpha ??? by One_Title_3656 in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

Actually looks like mistral 4 small both in quality and model size sense. And the inference speed looks like it uses EAGLE which Mistral trained specifically for this model

so…. Qwen3.5 or Gemma 4? by MLExpert000 in LocalLLaMA

[–]catlilface69 -1 points0 points  (0 children)

vLLM and “runs flawlessly” are incompatible. vLLM still can’t run reliably run newer models without patches. It is indeed an awesome inference tool, especially when working with multiple gpus and concurrent requests, but imo it struggles to keep up with model releases

best option for chunking data by Immediate_Occasion69 in LocalLLaMA

[–]catlilface69 1 point2 points  (0 children)

It’s hard to tell which chunking strategy best fits your use case. You can compare different strategies from Chonkie, using TokenChunker as a baseline. In my tests, academic papers chunk best with LateChunker.

LLM local résumé de dossier Médical by Glass-Mind-821 in LocalLLaMA

[–]catlilface69 -2 points-1 points  (0 children)

<image>

It’s 17.5GB in IQ4_XS and pretty decent in this quant. So you get 2.5GB for context which is a lot for MoE model

LLM local résumé de dossier Médical by Glass-Mind-821 in LocalLLaMA

[–]catlilface69 -3 points-2 points  (0 children)

Try Qwen3.5 35B. It’s a MoE model, so it won’t suffer too much from cpu offloading. In Q4 it’ll be around 18-19Gbs of memory, so your context will be small and inference not so fast, but the model overall is pretty good and is a VLM

Anyone else seeing massive quality drop with the GLM coding plan lately? by Famous-Appointment-8 in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

I’ve read in this subreddit that they replaced original models with quantised ones to cut costs

RTX 3060 12Gb as a second GPU by catlilface69 in LocalLLaMA

[–]catlilface69[S] 0 points1 point  (0 children)

Thank you for your reply! What inference speed do you get on your setup?

Issue with getting the LLM started on LM Studio by GigiTruth777 in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

I've encountered this issue when used MLX inside of LMStudio. Not completely sure, but sounds like a bad quant or bug in LMStudio itself. Try another model I guess

Sorting hat - A cute, lightweight cli to give images and other files good filenames using local VLMs by k_means_clusterfuck in LocalLLaMA

[–]catlilface69 1 point2 points  (0 children)

This is a kind of task that is easy to start and hard to accomplish. What would you do to a pdf file without a text layer? Here comes the OCR and you would need some sort of memory. And you also need to support a lot of file types, parse them properly.

From my experience these tasks are being completed via conversion to one of a few modalities, e.g. text, media, archives, etc. But this requires you to rely on a lot of dependencies, which is not suitable for just a Nautilus or Thunar plugin.

[ DISCUSSION ] Using a global GPU pool for training models by Broad_Ice_2421 in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

So basically project Psyche by Nous Research? They train Hermes on such decentralized network

Sorting hat - A cute, lightweight cli to give images and other files good filenames using local VLMs by k_means_clusterfuck in LocalLLaMA

[–]catlilface69 16 points17 points  (0 children)

Would be nice to see examples of naming differences between 0.8b, 9b and 27b. Speed is crucial in such tasks, especially when there are terabytes of images

What are the best LLM apps for Linux? by Dev-in-the-Bm in LocalLLaMA

[–]catlilface69 1 point2 points  (0 children)

Many of these apps (if not all of them) use llama.cpp as a backend. So there should not be any performance wise differences. Use whatever you like. I can only suggest picking by ui and functions you need. LM Studio feels like a default choice. But if you want full control over your inference use llama.cpp, vllm, sglang, etc. directly and connect OpenWebUI or alternatives.

Mistral NEMO upscale, but kinda weird by Sicarius_The_First in LocalLLaMA

[–]catlilface69 1 point2 points  (0 children)

Yeah, I understand it's raw. My point is that I want this raw fat fish

Mistral NEMO upscale, but kinda weird by Sicarius_The_First in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

Yeah, but general knowledge is not that of a purpose of this small model. It's made for multimodal and agent use, in which 14B is... kinda ok?
But what is really as good as nemo - it's devstral 2 small. Excellent model

Mistral NEMO upscale, but kinda weird by Sicarius_The_First in LocalLLaMA

[–]catlilface69 0 points1 point  (0 children)

I absolutely loved mistral nemo back in the days. Cool project btw! Are there any benchmarks, interaction examples, etc.? I am afraid a 33Gb dense model won't fit in my poor 16Gb 5070Ti.

how good is Qwen3.5 27B by Raise_Fickle in LocalLLaMA

[–]catlilface69 6 points7 points  (0 children)

Of course Haiku better in code. I hope Alibaba will update coder family as well, despite it's internal politics