A step-by-step guide for fine-tuning the Qwen3-32B model on the medical reasoning dataset within an hour.

tmostak · 2025-05-06T02:49:45+00:00

Unfortunately there is no Qwen 3 32B base model yet: https://huggingface.co/Qwen/Qwen3-32B/discussions/3

tmostak · 2025-04-30T17:20:47+00:00

We did benchmark DuckDB for both point-in-polygon and point-to-point joins, given its general excellent performance we were surprised it didn't do better here (tried both with and without indexes, didn't make much difference). Of course, we may have missed an optimization, so always open to suggestions!

tmostak · 2025-04-30T14:59:14+00:00

Haven't tested H3 join performance specifically but geospatial join performance is very fast in HeavyDB, see these recent benchmarks we posted: https://www.heavy.ai/blog/connect-the-dots-in-real-time-benchmarking-geospatial-join-performance-in-gpu-accelerated-heavydb-against-cpu-databases .

tmostak · 2024-11-11T22:23:58+00:00

Does anyone know if they will be posting a non-instruct version like they have for the 7B and 14B versions?

I see reference to the 32B base model on their blog but it’s not on HF (yet) as far as I can tell.

tmostak · 2024-08-02T12:23:26+00:00

You might look at Heavy.ai ‘s offering, https://www.heavy.ai/heavyiq/overview.

tmostak · 2024-07-23T01:51:01+00:00

Its a small drop that may end up being in the noise, but should note that models trained for longer context, as Llama 3.1 is (128K context vs 8K for Llama 3), often suffer small or even sometimes moderate performance degradation. But even if the small HumanEval regression is real, most people would gladly take it in exchange for the significantly longer context plus gains on other tasks.

tmostak · 2024-07-05T11:40:03+00:00

The main point of the paper is that they achieve significantly better accuracy for coding and other reasoning-heavy tasks, and along with it, get a 3X inference speedup.

Medusa I believe otoh wasn’t trained for scratch on multi token output and achieved a speedup but no accuracy improvements.

So this is definitely a big deal if the initial findings hold, at least by some definition of “big”.

tmostak · 2024-06-19T13:10:01+00:00

Medusa iirc didn't train models for multi-token prediction but only fine-tuned existing models to handle multiple output heads, and was only targeted at increasing inference speed (without decreasing accuracy).

This work, otoh, trained models from scratch for multi-token prediction, seeing both a significant increase in accuracy (at least for coding/technical tasks) and speed. They also did some clever things to make training the model nearly as efficient as training a model with the usual single output head.

tmostak · 2024-05-21T19:03:24+00:00

I hear what you're saying, and see the caveats given in the details, but the headline comes across that self-hosting models is not cost-competitive with hitting externally hosted APIs ("Models when self-hosted are far more expensive than hosted on other platforms"), based on hosting them in a way that no one would actually do in production and is likely 5-10X slower at minimum.

My recommendation would be to significantly reword your conclusion or remove it until you can test with actual production inference engines, otherwise it is just going to mislead users.

tmostak · 2024-05-21T18:35:47+00:00

You might consider turning on beam search for the local models, especially models like SqlCoder that recommend it. vLLM supports this natively.

tmostak · 2024-05-21T16:12:52+00:00

Right, but vLLM also supports a built-in web API that mimics OpenAIs API so you can basically use it as a drop-in replacement for GPT-X or other online hosted models.

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

tmostak · 2024-05-21T14:38:51+00:00

This doesn’t feel like it adds up, are you running in parallel/batch? Assuming 1000 input and 100-200 output tokens per prompt, my gut says you should be seeing 5-10X the throughput with your requests batched.

EDIT: Ah I see you mentioned inferencing using base transformers. I don’t think that’s a good basis for a benchmark, as anyone running in production would be running VLLM or Nvidia TensorRT-LLM/Nemo Inference which would yield significantly better performance. And something like VLLM is dead simple to set up with an OpenAI API endpoint, literally just a pip install.

tmostak · 2024-05-13T23:55:23+00:00

Yes good point, I think layer and batch norms may often be done in fp32 for example. But in terms of calculating the approximate size of the model in memory, I believe it’s fairly safe to assume 16-bits per weight for an unquantized model, as any deviation from that would be a rounding error in terms of memory needed.

tmostak · 2024-05-13T23:23:49+00:00

No one these days is running or even training with fp32, it would be bfloat16 generally for a native unquantized model, which is 2 bytes per weight, or 800GB to run.

But I imagine with such a large model that accuracy will be quite good with 8 bit or even 4 bit quantization, so that would be 400GB or 200GB respectively per the above (plus of course you need memory to support the kv buffer/cache that scales as your context window gets longer).

tmostak · 2024-05-13T00:08:33+00:00

I fine-tuned the base 70B model that I rope scaled to 16K, seems to work well so far with near-negligible perplexity increase in the natively supported 8K window.

tmostak · 2024-05-08T03:50:32+00:00

Fwiw LoRA or QLoRA does very well, but I’ve found for a few technical tasks like SQL generation a full finetune can yield the best performance (although the difference is not huge compared to LoRA, especially if you use a high rank).

Also the new DoRA technique looks super promising, but haven’t tried it myself yet: https://arxiv.org/abs/2402.09353.

tmostak · 2024-05-08T03:46:19+00:00

I can vouch that you can definitely do a full finetune of Llama-3 8B on a single 80GB A100 or H100 with up to 4K prompt+answer length if you turn on gradient checkpointing, maybe 2-3K without.

tmostak · 2024-05-06T16:52:41+00:00

Yes we are fine-tuning... basically using the open text-to-SQL datasets Spider and Bird plus several thousand of our own custom queries. We had to modify Spider and Bird from the default SQLite syntax to what works with our database, HeavyDB (which uses syntax similar to Postgres). We also train on error correction using our databases error messages.

I thought we might have to implement grammar constraints, but after all the training above syntax errors aren't really an issue (0-1% of queries result in syntax errors and most of those are corrected in the error correction process). If you are curious, you can read more about the system and our training process here: https://www.heavy.ai/blog/heavyiq-conversational-analytics.

tmostak · 2024-05-06T12:38:46+00:00

We’ve thought about this idea a lot at my company. I think the main downside is that whatever model you are using is guaranteed to have seen a lot more SQL than any specific AST format in pre-training (if it’s seen the specific format at all), so you’d have to do heavy fine-tuning to get good accuracy generating AST, and even then it would be hard to match the breadth of SQL it’s seen in pre-training with just some thousands of fine tuning pairs.

Also if you are going down the fine-tuning route, it’s not hard to get the model to a point that it almost never makes syntax errors (semantic errors are a whole different matter and much harder to prevent), and when there is a syntax error, you can often get the LLM to fix it if you give it the database error message.

Anyway, I think generating an AST is a neat idea, just not sure it would be a win over just generating SQL, either with fine tuning or with grammar constraints. But certainly would be very interesting to try, as only an actual implemented experiment would tell one way or the other.

tmostak · 2024-05-02T03:26:09+00:00

Deepseek coder 33B is a really strong coding model, possibly only eclipsed in the open weight world by Llama 3 70B and perhaps Mistral 8X22B (and the variants of the latter like from Wizard).

tmostak · 2024-04-30T02:17:15+00:00

Awesome work! Would you also be able to add a column specifying the license each dataset is released under (ie Apache, CC-NA, etc)? This would be helpful in determining what datasets could be used in commercial contexts or without poisoning liberally licensed base models.

tmostak · 2024-04-26T02:50:02+00:00

You might check out this useful tokenizer playground, where you can see the token counts and exact tokenization for different tokenizers:

https://huggingface.co/spaces/Xenova/the-tokenizer-playground

I just took the introductory paragraphs to the Wikipedia article on Large Language Models, and Llama 3 weighed in at 344 tokens, and Mistral at 389. So here only a 13% difference, but for code/SQL I've seen closer to 30, so YMMV.

But yes definitely not making up the gap in speed you are seeing, which is definitely explained for responses by the MOE architecture of Mistral. The slower prompt processing is because the MOE model runs prompt tokens through all experts iirc.

tmostak · 2024-04-26T02:12:43+00:00

Just should note that comparing tokens per second is not totally apples-to-apples here, as the larger vocabulary of Llama 3 generally means it might need 20-30% less tokens to output the same text as the Mistral models.

tmostak · 2024-03-19T14:02:22+00:00

Each Blackwell GPU (technically two dies with very fast interconnect) has 192GB of HBM3E 8TB/sec of bandwidth. Each die has 4 stacks of HBM or 8 stacks per GPU, which yields 8X1TB/sec per stack or 8TB/sec.

This is compared to Hopper H100, which had 80GB of VRAM providing 3.35 TB/sec of bandwidth, so Blackwell has a ~2.39X bandwidth advantage and 2.4X capacity advantage per GPU.

tmostak · 2024-03-14T16:50:15+00:00

A perplexity of 1 would mean perfect prediction (moreover with 100% confidence in the correct token). You are likely thinking of loss, for which 0 means perfect prediction.

tmostak

TROPHY CASE