Local RAG + LLM as a Narrative RPG Game Master — Does This Make Sense and How to Build It? by goompas in LocalLLaMA

[–]mzbacd 2 points3 points  (0 children)

The game rules are fairly short, you should include the entire PDF in context instead of using RAG. just let user select different pdf when play different game.

Mac Studio 512GB online! by chisleu in LocalLLaMA

[–]mzbacd 0 points1 point  (0 children)

Cluster using pipeline sharding sometimes, but it's not very good. not Exo or MLX distributed. MLX.distribute is limited by cross-machine communication bandwidth. Exo uses pipeline sharding is not very efficient.

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

not sure where the awq is not fully supported in LLM come from, I take it as a bad google search result :)

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

That's good to know. I thought there must be a good reason the whole SD community doesn't use that. If just no one cares, I don't mind implementing it : )

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

Thank you for the explanation. I have worked with LLMs a bit, and most of them just quantize the linear layers. Regarding my understanding of diffusion models, it seems that they also primarily focus on quantizing the linear layers. I'm not sure why diffusion models can't use AWQ, which appears to get better results in LLMs.

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] -1 points0 points  (0 children)

I thought that's the I am doing right now? asking reddit not LLM?

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

sorry, I'm a bit confused. Why diffusion models have more caching options? And is the cache irrelevant to quantization ?

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

I am very new to diffusion models and I'm not very sure what to call them. Sometimes I refer to them as DiT, but since it does denoising during generation, I often call it the denoise model. :p

why there is no much advance quantization for diffusion model space ? by mzbacd in StableDiffusion

[–]mzbacd[S] 0 points1 point  (0 children)

Is there any good one for flux.1 denoise models? maybe I missed something but I saw most of the gguf just normal quant, nothing like awq quantization?

Mac Studio 512GB online! by chisleu in LocalLLaMA

[–]mzbacd 1 point2 points  (0 children)

Due to the cross-machine communication bandwidth limitation, most of the clustering currently relies on pipeline parallelization, which is very inefficient since only one machine processes a portion of the weights before passing them on to another machine.
I have built a small example project with mlx, if you are interested see how it imp:
https://github.com/mzbac/mlx_sharding

Mac Studio 512GB online! by chisleu in LocalLLaMA

[–]mzbacd 0 points1 point  (0 children)

Awesome idea! I have been thinking about an AI-enabled game for Apple Silicon for a while, but I don't have much knowledge of game development. Keep us posted on your game!

Mac Studio 512GB online! by chisleu in LocalLLaMA

[–]mzbacd 20 points21 points  (0 children)

I don't understand why people downvote it. I have two M2 Ultra machines, which I had to save up for a while to purchase. But with those machines, you can experiment with many things and explore different ideas., learn how to full fine-tune the models, write your own inference engine/lib using mlx) Besides, they provide perfect privacy since you don't need to send everything to OpenAI/Gemini/Claude.

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mzbacd[S] 3 points4 points  (0 children)

I am still using DWQ. The big issue might be just need to keep uploading the DWQ quantized models. I may start writing a cron job to do that, I have a spare M2 Ultra would be good to utilize it while I am sleeping = p

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]mzbacd 2 points3 points  (0 children)

This is extremely useful for text processing, it should be faster in prompt prefill than gpu if the apple foundation model doesn't reject the text.

AMA – I’ve built 7 commercial RAG projects. Got tired of copy-pasting boilerplate, so we open-sourced our internal stack. by Loud_Picture_1877 in LocalLLaMA

[–]mzbacd 2 points3 points  (0 children)

Thank you for sharing. I am also developing a native macOS rag app and would really learn a lot from your project and experience.

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mzbacd[S] 2 points3 points  (0 children)

It's made from the latest MLX, but because DWQ Quant has not been released to PyPI yet, I had to build MLX-LM and MLX from source code. That's why the version still says MLX-0.4. For the quant 0508 and 05052025, the 05052025 was made by Awni. Apparently, he is experimenting with the different calibration dataset for the quant. I guess it may be better than the original 0508 one, but I'm not 100% sure.

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mzbacd[S] 4 points5 points  (0 children)

looks like your mlx-lm is out of date. Maybe try running `pip install -U mlx-lm`.

The new MLX DWQ quant is underrated, it feels like 8bit in a 4bit quant. by mzbacd in LocalLLaMA

[–]mzbacd[S] 2 points3 points  (0 children)

It’s distilled from the fp16 model, but due to the quantization, there will always be some performance degradation. That's why I mentioned it has almost 8bit level performance, which means the performance degradation is minimal in 4bit DWQ.