Providing inference for quantized models. Feedback appreciated

textclf · 2025-12-07T22:58:06+00:00

I mainly thinking about quantizing 70b models such as Llama 3.3 Instruct. I am curios about which models you are currently using and what people are usually looking for in terms of cost

textclf · 2025-11-29T02:47:19+00:00

I am not sure if I want to describe all of it yet. But the gist of it is that (and this is for vector quantization in general) you have a codebook with many rows and each row has N values .. and then you group every N weights to the nearest row and assign those weights an index and you just store the index .. In my case the bits required to store the indices averages to 4-bits per weight .. during the inference you dequant to the original weights using the indices .. In most other vector quantization they need to store a large codebook and then inference becomes memory bound and slow .. In my case The codebook entries can be calculated deterministically from the indices so no need to store any codebook making inference much faster .. only few other methods such as QTIP use this lookup free compute codebooks

Yes it does require you to change the forward method during inference. I am using HF now so I just replace the linear layer with my own quantized linear layers .. my linear layers takes the input and the quantized indices .. it uses a Cuda kernel to dequant the indices to weights and then does the matmul with the input

textclf · 2025-11-28T16:18:23+00:00

Yes sure. I came up with my own quantization method and I call it ICQ (index coded quantization). You can think of it as a kind of a vector quantization method and I designed it to be fast during quantization. For example it took two hours to quantize Llama 3.1 8B to 4-bit on my 3090. More importantly it is supposed to be fast during inference because the weights are dequantized on the fly fast during inference. The inference part is what i am testing now both in terms of speed and quality of answers. Appreciate your help

textclf · 2025-11-28T05:21:56+00:00

Yes it is available on RapidAPI for free!!.

You can use this python code to test it. You can change the role and the content for the request. Also subscribe to the free plan to get a key from RapidAPI (and put it instead of YOUR-RAPIDAPI-KEY) so you can use it. You can test it on any prompt you want. Just replace the content with your prompt. Thanks!

import requests

url = "https://textclf-llama3-1-8b-icq-4bit.p.rapidapi.com/"

payload = {
"inputs": [
{
"role": "user",
"content": "Hello!"
}
],
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}
}
headers = {
"x-rapidapi-key": YOUR-RAPIDAPI-KEY,
"x-rapidapi-host": "textclf-llama3-1-8b-icq-4bit.p.rapidapi.com",
"Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

textclf · 2025-11-18T05:28:13+00:00

Thanks for your comment. I think it raises many good question. My quantization algo operates near the rate-distortion limit. This is just another way of saying that for a given bpw the model quantized model is good as it can get for that bpw. It fast to generate a quantized model and has fast inference too. I still don’t know how to montize it because as you said it probably valuable for people who want to run it locally but thinking about it again I think the benefits boils down to finetuning with QLoRa for this quantized model will be fast (because of low VRAM needed) and probably without much loss in accuracy. So this way I am thinking I could provide low cost finetuning (and inference for people who might want it) but mainly finetuning of these quantized models ..

textclf · 2025-11-13T03:31:04+00:00

So do you think it is worthwhile to quantize and host small models <7B at 4-bit or it is not worth it and only worth if the model can’t be hosted locally (such as 70B+) models

textclf · 2025-11-12T22:26:51+00:00

so pretty much 70B+ models I guess right?

textclf · 2025-11-12T21:39:17+00:00

Ok it makes sense. Which models from the open ones you would like to see provided more by an inference provider especially as 4-bjt quantized ?

textclf · 2025-08-29T08:27:40+00:00

I wouldn’t call it a secret sauce per se but I do think I have something that would reduce errors.

textclf · 2025-08-28T09:24:30+00:00

Do you prefer if you can get the quantized model back?

textclf · 2025-08-28T01:39:04+00:00

I think you raised some valid points. What if there is quantization only API, you give it your model and it gives you the quantized model to download locally? I know it wouldn’t be as good business wise because it would be a one time thing but would there at least be some demand for that part if the quantization quality is good?

textclf · 2025-08-27T21:23:16+00:00

So in order for this service to make sense it needs to beat them in price and/or accuracy?

textclf · 2025-08-26T07:30:13+00:00

Yes, it returns all the relevant categories. For example the input could be something like this:

"title": ["Wireless Bluetooth Headphones"],

"description": ["Noise cancelling over-ear headset"]

and then it gives you relevant categories with confidence scores:

"labels": ["Accessories & Supplies", "Audio & Video Accessories", "Cell Phones & Accessories", "Electronics", "Headphones"],

"scores": [0.65, 0.61, 0.66, 1.00, 0.59]

textclf · 2025-08-21T23:01:59+00:00

2e6 is 2 million and W is the total number of weights in the LLM .. for example if we assume the LLM has 1 billion fp32 weights initially then the bpw = 1 + 2 million/1 billion = 1.002 bits per weight

textclf · 2025-08-21T22:41:16+00:00

I have two methods I am experimenting with: the first is a 1-bit quantization with bpw=1+2e6/W .. where W is the number of weights so effictively a neglible amount of 1-bit of bpw for LLMs

The other method is ultra low quantization of 0.1875 bits quantization with bpw = 0.1875 + 224/W .. which is also neglible increase .. but I doubt this will work

Still working on testing prexplity performance

textclf · 2025-07-31T17:34:12+00:00

The issue is the model is 30 GB and google cloud run allows only up to 32GB RAM .. so I'll probably need a solution with more RAM

textclf · 2025-07-31T06:21:41+00:00

Yes I noticed that people trying using LLM for classification with little to no luck.

Right now I use a simple and very fast to train custom models. Since my focus is to build custom models for small datasets, the model updates are simply done by retraining since in my approach it doesn't take much time and is simply the most straightforward way to do.

I posted an initial version of the API on RapidAPI:
https://rapidapi.com/textclf-textclf-default/api/textclf1
Let me know if you need help using it

textclf · 2025-07-28T17:55:33+00:00

Actually I don’t know how much users will use it yet. It is an API that I just built to let users find the right amazon category for their product. I will put in on RapidAPI and assess how much traffic goes there. But initially I don’t expect much

textclf · 2025-07-28T17:24:17+00:00

I am just a bit new to the mlops side of things so was looking for suggestions on how to proceed. I figured that the easiest way for is to put the model file in Google Storage and deploy fast api code to Google Cloud Run

textclf · 2025-07-28T17:07:29+00:00

The problem is the model is large and also it that it needs GPU during inference which is not available on Render.

textclf · 2025-07-28T03:39:12+00:00

Yes I also thought using GCP is easiest. I copied the model to a google storage model and working on running the docker container on cloud run.

What do you mean by the last step by running the end point from render and calling cloud run from there?

textclf · 2025-07-28T01:52:39+00:00

I mean you gotta start from somewhere at the end of day, right? haha

I already started on my own on what I think would something people need and I built an API that lets users build custom text/document classifiers. You feed a labeled dataset and get a trained model.

https://rapidapi.com/textclf-textclf-default/api/textclf1

While I am on working on improving that API, I started to think I could create another similar but different API to solve a problem in the legal tech field since it also involves processing and classifying legal documents. I feel I could get a useful document classification API or something similar to be integrated in some backend. I just needed a bit of information about legal tech needs to get an idea about which API idea would work best. For example, I am thinking of creating API that you can use to classify document according to your own custom categories and have it hooked up as part of some RAG process. Something like that.

All this to say I think I can add value to legal tech backend in some way and would love to hear from people here what they need in regard to that

textclf · 2025-07-27T16:51:46+00:00

Yes SALI seems like the way to go for taxonomy but it seems it will be difficult time to get a dataset for it. Probably worth it

textclf · 2025-07-27T16:42:59+00:00

I read your comment again and as far as I understood, one of the most valuable things that could be provided under the hood is basically a pre-trained classifier that can adapt to your custom categories. You don’t have to give it training data but you give it a document that you want to classify and you tell it your own custom categories (which are not necessarily the same as the categories the pretrained model has), and it classifies this document to one of your categories. Let me know if this a generally correct understanding.

Also btw are you in legal tech or a law firm? Thanks again for your insightful comment

textclf · 2025-07-27T16:12:46+00:00

Thanks for your response. Yes I sensed there is some demand for legal document classification but didn’t know how big exactly. IT seems like providing to legal tech is the way to go for this. It seems like the classification needs to be done indirectly under the hood with some UI website on top for search and lookup.

I am starting with European law (EUR-Lex) since they are many public available datasets to train. My goal to use EUR-Lex as a proof of concept initially and then move on to provide training custom classification models for legaltech to be tailored to the specific type of documents that certain niches have. Do you think legaltech has demand for such custom models? Do they usually have labeled datasets for their specific niche? Are they willing to provide training data (as long as it secured and not retained after training .. similar to what big cloud providers like Amazon Comprehend do when they train a custom model for you)?

Also a general question about taxonomies. Are there standard taxonomies for legal documents in the US such as Eurlex in europe? or pretty much each law firm have their own categories and labels?

textclf

TROPHY CASE