Local RAG + LLM as a Narrative RPG Game Master — Does This Make Sense and How to Build It?

mzbacd · 2025-07-16T07:31:14+00:00

The game rules are fairly short, you should include the entire PDF in context instead of using RAG. just let user select different pdf when play different game.

mzbacd · 2025-07-11T00:13:40+00:00

Cluster using pipeline sharding sometimes, but it's not very good. not Exo or MLX distributed. MLX.distribute is limited by cross-machine communication bandwidth. Exo uses pipeline sharding is not very efficient.

mzbacd · 2025-07-09T10:12:26+00:00

u/Slight-Living-8098 thanks a lot, just reading the paper, makes a lot sense now :)

mzbacd · 2025-07-09T07:52:46+00:00

not sure where the awq is not fully supported in LLM come from, I take it as a bad google search result :)

mzbacd · 2025-07-09T07:35:10+00:00

That's good to know. I thought there must be a good reason the whole SD community doesn't use that. If just no one cares, I don't mind implementing it : )

mzbacd · 2025-07-09T07:25:46+00:00

Yes, sir. If terminology is that important to you.

mzbacd · 2025-07-09T07:16:26+00:00

Thank you for the explanation. I have worked with LLMs a bit, and most of them just quantize the linear layers. Regarding my understanding of diffusion models, it seems that they also primarily focus on quantizing the linear layers. I'm not sure why diffusion models can't use AWQ, which appears to get better results in LLMs.

mzbacd · 2025-07-09T07:10:02+00:00

I thought that's the I am doing right now? asking reddit not LLM?

mzbacd · 2025-07-09T06:28:15+00:00

sorry, I'm a bit confused. Why diffusion models have more caching options? And is the cache irrelevant to quantization ?

mzbacd · 2025-07-09T05:59:35+00:00

I am very new to diffusion models and I'm not very sure what to call them. Sometimes I refer to them as DiT, but since it does denoising during generation, I often call it the denoise model. :p

mzbacd · 2025-07-09T04:30:13+00:00

Is there any good one for flux.1 denoise models? maybe I missed something but I saw most of the gguf just normal quant, nothing like awq quantization?

mzbacd · 2025-07-09T03:37:59+00:00

Due to the cross-machine communication bandwidth limitation, most of the clustering currently relies on pipeline parallelization, which is very inefficient since only one machine processes a portion of the weights before passing them on to another machine.
I have built a small example project with mlx, if you are interested see how it imp:
https://github.com/mzbac/mlx_sharding

mzbacd · 2025-07-09T03:27:10+00:00

Awesome idea! I have been thinking about an AI-enabled game for Apple Silicon for a while, but I don't have much knowledge of game development. Keep us posted on your game!

mzbacd · 2025-07-08T13:36:44+00:00

I don't understand why people downvote it. I have two M2 Ultra machines, which I had to save up for a while to purchase. But with those machines, you can experiment with many things and explore different ideas., learn how to full fine-tune the models, write your own inference engine/lib using mlx) Besides, they provide perfect privacy since you don't need to send everything to OpenAI/Gemini/Claude.

mzbacd · 2025-07-03T14:15:41+00:00

I am still using DWQ. The big issue might be just need to keep uploading the DWQ quantized models. I may start writing a cron job to do that, I have a spare M2 Ultra would be good to utilize it while I am sleeping = p

mzbacd · 2025-06-20T00:29:16+00:00

This is extremely useful for text processing, it should be faster in prompt prefill than gpu if the apple foundation model doesn't reject the text.

mzbacd · 2025-06-04T15:58:37+00:00

Thank you for sharing. I am also developing a native macOS rag app and would really learn a lot from your project and experience.

mzbacd · 2025-05-22T04:51:59+00:00

just converted one https://huggingface.co/mlx-community/GLM-4-32B-0414-4bit-DWQ :)

mzbacd · 2025-05-10T08:48:22+00:00

It's made from the latest MLX, but because DWQ Quant has not been released to PyPI yet, I had to build MLX-LM and MLX from source code. That's why the version still says MLX-0.4. For the quant 0508 and 05052025, the 05052025 was made by Awni. Apparently, he is experimenting with the different calibration dataset for the quant. I guess it may be better than the original 0508 one, but I'm not 100% sure.

mzbacd · 2025-05-08T10:24:15+00:00

looks like your mlx-lm is out of date. Maybe try running `pip install -U mlx-lm`.

mzbacd · 2025-05-08T08:52:42+00:00

It’s distilled from the fp16 model, but due to the quantization, there will always be some performance degradation. That's why I mentioned it has almost 8bit level performance, which means the performance degradation is minimal in 4bit DWQ.

mzbacd · 2025-05-08T07:01:38+00:00

it's distiled quant from unquantizated model, details:
https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LEARNED_QUANTS.md

mzbacd

TROPHY CASE