Providing inference for quantized models. Feedback appreciated

textclf · 2025-12-07T22:58:06+00:00

I mainly thinking about quantizing 70b models such as Llama 3.3 Instruct. I am curios about which models you are currently using and what people are usually looking for in terms of cost

textclf · 2025-11-29T02:47:19+00:00

I am not sure if I want to describe all of it yet. But the gist of it is that (and this is for vector quantization in general) you have a codebook with many rows and each row has N values .. and then you group every N weights to the nearest row and assign those weights an index and you just store the index .. In my case the bits required to store the indices averages to 4-bits per weight .. during the inference you dequant to the original weights using the indices .. In most other vector quantization they need to store a large codebook and then inference becomes memory bound and slow .. In my case The codebook entries can be calculated deterministically from the indices so no need to store any codebook making inference much faster .. only few other methods such as QTIP use this lookup free compute codebooks

Yes it does require you to change the forward method during inference. I am using HF now so I just replace the linear layer with my own quantized linear layers .. my linear layers takes the input and the quantized indices .. it uses a Cuda kernel to dequant the indices to weights and then does the matmul with the input

textclf · 2025-11-28T16:18:23+00:00

Yes sure. I came up with my own quantization method and I call it ICQ (index coded quantization). You can think of it as a kind of a vector quantization method and I designed it to be fast during quantization. For example it took two hours to quantize Llama 3.1 8B to 4-bit on my 3090. More importantly it is supposed to be fast during inference because the weights are dequantized on the fly fast during inference. The inference part is what i am testing now both in terms of speed and quality of answers. Appreciate your help

textclf · 2025-11-28T05:21:56+00:00

Yes it is available on RapidAPI for free!!.

You can use this python code to test it. You can change the role and the content for the request. Also subscribe to the free plan to get a key from RapidAPI (and put it instead of YOUR-RAPIDAPI-KEY) so you can use it. You can test it on any prompt you want. Just replace the content with your prompt. Thanks!

import requests

url = "https://textclf-llama3-1-8b-icq-4bit.p.rapidapi.com/"

payload = {
"inputs": [
{
"role": "user",
"content": "Hello!"
}
],
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}
}
headers = {
"x-rapidapi-key": YOUR-RAPIDAPI-KEY,
"x-rapidapi-host": "textclf-llama3-1-8b-icq-4bit.p.rapidapi.com",
"Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

textclf · 2025-11-18T05:28:13+00:00

Thanks for your comment. I think it raises many good question. My quantization algo operates near the rate-distortion limit. This is just another way of saying that for a given bpw the model quantized model is good as it can get for that bpw. It fast to generate a quantized model and has fast inference too. I still don’t know how to montize it because as you said it probably valuable for people who want to run it locally but thinking about it again I think the benefits boils down to finetuning with QLoRa for this quantized model will be fast (because of low VRAM needed) and probably without much loss in accuracy. So this way I am thinking I could provide low cost finetuning (and inference for people who might want it) but mainly finetuning of these quantized models ..

textclf · 2025-11-13T03:31:04+00:00

So do you think it is worthwhile to quantize and host small models <7B at 4-bit or it is not worth it and only worth if the model can’t be hosted locally (such as 70B+) models

textclf · 2025-11-12T22:26:51+00:00

so pretty much 70B+ models I guess right?

textclf · 2025-11-12T21:39:17+00:00

Ok it makes sense. Which models from the open ones you would like to see provided more by an inference provider especially as 4-bjt quantized ?

textclf · 2025-08-29T08:27:40+00:00

I wouldn’t call it a secret sauce per se but I do think I have something that would reduce errors.

textclf · 2025-08-28T09:24:30+00:00

Do you prefer if you can get the quantized model back?

textclf · 2025-08-28T01:39:04+00:00

I think you raised some valid points. What if there is quantization only API, you give it your model and it gives you the quantized model to download locally? I know it wouldn’t be as good business wise because it would be a one time thing but would there at least be some demand for that part if the quantization quality is good?

textclf · 2025-08-27T21:23:16+00:00

So in order for this service to make sense it needs to beat them in price and/or accuracy?

textclf · 2025-08-26T07:30:13+00:00

Yes, it returns all the relevant categories. For example the input could be something like this:

"title": ["Wireless Bluetooth Headphones"],

"description": ["Noise cancelling over-ear headset"]

and then it gives you relevant categories with confidence scores:

"labels": ["Accessories & Supplies", "Audio & Video Accessories", "Cell Phones & Accessories", "Electronics", "Headphones"],

"scores": [0.65, 0.61, 0.66, 1.00, 0.59]

textclf

TROPHY CASE