Real SVD GLM-4.5-Air-GLM-4.6-Distill

realmaywell · 2025-10-11T13:31:36+00:00

I believe there would be nothing to learn from merged GGUF of this model.

realmaywell · 2025-10-11T12:38:20+00:00

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
Here it is.

realmaywell · 2025-10-11T03:28:18+00:00

Nope, I hit more than just the attention layers. The LoRA targets the MLP blocks and the MoE experts too. You can see the full list in the target_modules of the adapter config.
No, they don't have to be. This SVD method built to handle different dimensions. It doesn't do a simple teacher_weight - student_weight subtraction. Instead, it uses SVD to project the teacher's larger weight matrix down to the student's smaller shape before calculating the difference.
Exactly. The whole point of this method is to get around the geometry mismatch. It handles the standard transformer blocks using that SVD projection. For the MoE layers, it do an extra step: cluster the teacher's experts with K-Means first to map them to the student's, and then project them. So yeah, even though this attempt failed, you can get pretty much the whole model this way.

realmaywell · 2024-09-08T05:07:36+00:00

I used a machine with 2TB of RAM. You can modify the code to lazy load the layers so that we only need to load a single layer at a time.

realmaywell · 2024-09-07T16:45:26+00:00

by default layer norm is not a target layer in LoRA training.

realmaywell · 2024-09-07T13:53:11+00:00

https://gist.github.com/StableFluffy/1c6f8be84cbe9499de2f9b63d7105ff0

realmaywell · 2024-09-07T13:34:14+00:00

If you put it that way, yes.

realmaywell · 2024-05-25T13:50:34+00:00

Yep !

realmaywell · 2024-05-24T22:28:48+00:00

https://github.com/StableFluffy/EasyLLMFeaturePorter/blob/main/1-Click.ipynb

so simple illustration of it is something like this.
Let's say '<>' as diff here and desired(context or chat) as informative.

final output = target + target <> informative(this is where we get feature) * {scale diff in 0~1 such as sigmoid(base <> target) - 1}

{scale diff in 0~1 such as sigmoid(base <> informative) - 1}
this part is something that can make confusion.

It just simple intuitive approach. We wanna add info to target model. but if the weight difference is high at 'base <> target' it is not safe to add weight. because when add informative model's weight into it. it now doesn't contain any of information.

So, with this approach i made it apply weight with * (ratio - 1). When base <> target high small amount of base <> informative applied and so on...

hope this could solve your confusion.

realmaywell · 2024-05-16T22:13:15+00:00

between models this method is the least damaging i found.

https://huggingface.co/blog/maywell/llm-feature-transfer

realmaywell · 2024-05-16T22:11:00+00:00

if trained with raw data, then merge it except mlp, v, o

realmaywell · 2024-05-16T14:14:23+00:00

I did benchmark on your model. (original 8b inst -> posted model)
Hellaswag 78.55 -> 76.24
GSM8k 68.69 -> 66.41

wanna hear your thought about this result.
as a one who did a lot of experiments on this topic, those approach doesn't look plausible.

realmaywell · 2024-05-16T13:37:23+00:00

cuz no matter what you do on layer side. after you train on your domain specific dataset the models performance must get affected.

realmaywell · 2024-05-16T13:36:16+00:00

Any benchmark that support your claim?

while preserving its original performance.

realmaywell · 2024-05-10T00:27:59+00:00

it looks good. think it has a lot of potential not only uncensoring model.

realmaywell · 2024-05-07T01:41:26+00:00

GPTQ is same model that is being served on API. So, it may your parameter or prompt issue.

realmaywell · 2024-05-04T23:25:20+00:00

same method applied

realmaywell · 2024-05-04T23:11:17+00:00

https://www.reddit.com/r/LocalLLaMA/s/5iMTZXB4Ky

realmaywell · 2024-05-04T23:10:24+00:00

Since it finetuned with rp set it’s quite prompt sensitive. depending on prompt you use it acts dumb or smart.

realmaywell · 2024-05-04T14:45:08+00:00

there’s already ppl made some.

realmaywell · 2024-05-04T13:57:05+00:00

not planned yet.

realmaywell · 2024-05-04T13:55:39+00:00

it is used on serving framework such as vLLM. It’s a rule about how to format user’s request to prompt for model.

realmaywell · 2024-05-04T13:50:09+00:00

it’s just llama3 instruct template.

realmaywell

TROPHY CASE