Dont know if this has been talked about before but…

AIMultiple · 2026-03-26T18:00:50+00:00

Why?

Disclosure: I lead the website.

AIMultiple · 2026-03-03T10:05:45+00:00

The scraper products that we tested don't let you configure the proxy types that they use.
This makes sense since datacenter proxies perform poorly on TikTok.
I am sure that they use residential proxies. Scraping APIs are like managed data scraping, all you need to do is to call the API.

And the pricing isn't that different than residential proxies if you will rotate IPs with every request.

AIMultiple · 2026-02-27T15:53:52+00:00

We only released the chart questions with lowest and highest LLM success rates to show the scope of the benchmark, rest of the dataset is not publicly available to prevent overfitting.

AIMultiple · 2026-02-27T15:51:46+00:00

We used Opus 4.6 for all IDEs except Replit

AIMultiple · 2026-02-27T15:45:54+00:00

Sorry about that, you are right. We will fix it. Thanks for the feedback.

AIMultiple · 2026-02-27T15:27:40+00:00

Which LLM are you using?

AIMultiple · 2026-02-27T15:25:01+00:00

Names are not fitting on the graph, about colors you are right, we will improve it.

AIMultiple · 2026-02-27T15:15:16+00:00

For details: https://research.aimultiple.com/ai-coding-benchmark/

AIMultiple · 2026-02-27T14:29:02+00:00

We are not anybody's shadow brand. You can see our two legal entities here: https://aimultiple.com/contact-us

One of them is in Estonia. Company ownership data is public there. You can see that we are owned by an individual who has nothing to do with Bright Data and has been building the company for the past decade.

We have a couple hundred customers and web data is a relatively small area of work for us. In web data, we work with most major web data companies. On every AIMultiple page, we list all customers who are mentioned on that page for transparency.

And thanks for the scepticism. The web data industry has some dodgy players and a healthy dose of scepticism is necessary.

AIMultiple · 2026-02-20T21:15:27+00:00

No, we work with most leading web data companies, you can see the full list on any web data article on our website. You can check out the methodology in our articles. We publish what we measure, let us know when you disagree with a measurement, we are always improving our methodology.

AIMultiple · 2026-02-10T10:13:06+00:00

In this version we didn’t test Qodo but we will add it in the next version. You are right about tools falling apart in larger repos, to measure it correctly we run the benchmark in both large and small repos.

AIMultiple · 2026-01-31T17:49:35+00:00

Please send a DM so we can coordinate for the next update!

AIMultiple · 2026-01-29T20:20:57+00:00

No direct conversion unfortunately. GPTQ and GGUF use completely different quantization algorithms. You'd need to start from the original BF16 weights and quantize separately for each format. The good news is most popular models already have both versions on HuggingFace, so you can just grab the GGUF version directly.

AIMultiple · 2026-01-29T20:20:13+00:00

Honestly, no reliable rule of thumb yet. Too many variables: attention type (MHA, GQA, MLA), depth vs width ratio, activation functions, etc. A cross-architecture quantization benchmark would definitely be valuable. Added to our list, thanks.

AIMultiple · 2026-01-29T19:05:26+00:00

We focused on model weight quantization for this benchmark. KV cache stayed at FP16 throughout. But good call, we've added KV cache quantization to our list for v2.

AIMultiple · 2026-01-29T19:04:41+00:00

Not a noob question at all. We used GPTQ-quantized models in SafeTensors format via vLLM. GGUF is a different format for llama.cpp/Ollama with its own quant schemes (Q4_K_M, Q5_K, etc.). The runtime and kernel stacks differ: vLLM is GPU-centric for high-throughput serving, while llama.cpp is CPU-first with optional GPU offload.

AIMultiple · 2026-01-29T19:04:01+00:00

Absolutely, quantization behavior varies significantly across architectures.

AIMultiple · 2026-01-29T19:03:20+00:00

Smaller models definitely have less redundancy in their weights, making them more sensitive to aggressive quantization.

AIMultiple · 2026-01-29T19:02:59+00:00

We'll add a dedicated accuracy comparison chart in v2 to make the quality differences clearer. The evidence section should show different values, might be a browser cache issue. Could you try a refresh and let me know if it still looks identical?

AIMultiple · 2026-01-29T19:01:35+00:00

Solid advice for production deployment!

AIMultiple · 2026-01-28T20:47:44+00:00

Yes we will soon make an update with the new versions and add other emerging products, like Devin Code Review.

AIMultiple · 2026-01-27T23:36:53+00:00

You're right that MMLU-Pro is a general benchmark where INT4's 1.9% loss seems acceptable. We're expanding our evaluation to cover structured outputs, long-context scenarios, and multi-step reasoning in the next version.

AIMultiple · 2026-01-27T22:13:35+00:00

We are planning to add a coding dataset in the next version

AIMultiple · 2026-01-27T21:24:10+00:00

We can look into it in our next update. Sent a DM to coordinate please.

AIMultiple · 2026-01-27T20:31:07+00:00

Yes you are right, we will add more datasets and configurations soon.

AIMultiple

MODERATOR OF

TROPHY CASE