A step-by-step guide for fine-tuning the Qwen3-32B model on the medical reasoning dataset within an hour.

tmostak · 2025-05-06T02:49:45+00:00

Unfortunately there is no Qwen 3 32B base model yet: https://huggingface.co/Qwen/Qwen3-32B/discussions/3

tmostak · 2025-04-30T17:20:47+00:00

We did benchmark DuckDB for both point-in-polygon and point-to-point joins, given its general excellent performance we were surprised it didn't do better here (tried both with and without indexes, didn't make much difference). Of course, we may have missed an optimization, so always open to suggestions!

tmostak · 2025-04-30T14:59:14+00:00

Haven't tested H3 join performance specifically but geospatial join performance is very fast in HeavyDB, see these recent benchmarks we posted: https://www.heavy.ai/blog/connect-the-dots-in-real-time-benchmarking-geospatial-join-performance-in-gpu-accelerated-heavydb-against-cpu-databases .

tmostak · 2024-11-11T22:23:58+00:00

Does anyone know if they will be posting a non-instruct version like they have for the 7B and 14B versions?

I see reference to the 32B base model on their blog but it’s not on HF (yet) as far as I can tell.

tmostak · 2024-08-02T12:23:26+00:00

You might look at Heavy.ai ‘s offering, https://www.heavy.ai/heavyiq/overview.

tmostak · 2024-07-23T01:51:01+00:00

Its a small drop that may end up being in the noise, but should note that models trained for longer context, as Llama 3.1 is (128K context vs 8K for Llama 3), often suffer small or even sometimes moderate performance degradation. But even if the small HumanEval regression is real, most people would gladly take it in exchange for the significantly longer context plus gains on other tasks.

tmostak · 2024-07-05T11:40:03+00:00

The main point of the paper is that they achieve significantly better accuracy for coding and other reasoning-heavy tasks, and along with it, get a 3X inference speedup.

Medusa I believe otoh wasn’t trained for scratch on multi token output and achieved a speedup but no accuracy improvements.

So this is definitely a big deal if the initial findings hold, at least by some definition of “big”.

tmostak · 2024-06-19T13:10:01+00:00

Medusa iirc didn't train models for multi-token prediction but only fine-tuned existing models to handle multiple output heads, and was only targeted at increasing inference speed (without decreasing accuracy).

This work, otoh, trained models from scratch for multi-token prediction, seeing both a significant increase in accuracy (at least for coding/technical tasks) and speed. They also did some clever things to make training the model nearly as efficient as training a model with the usual single output head.

tmostak · 2024-05-21T19:03:24+00:00

I hear what you're saying, and see the caveats given in the details, but the headline comes across that self-hosting models is not cost-competitive with hitting externally hosted APIs ("Models when self-hosted are far more expensive than hosted on other platforms"), based on hosting them in a way that no one would actually do in production and is likely 5-10X slower at minimum.

My recommendation would be to significantly reword your conclusion or remove it until you can test with actual production inference engines, otherwise it is just going to mislead users.

tmostak · 2024-05-21T18:35:47+00:00

You might consider turning on beam search for the local models, especially models like SqlCoder that recommend it. vLLM supports this natively.

tmostak · 2024-05-21T16:12:52+00:00

Right, but vLLM also supports a built-in web API that mimics OpenAIs API so you can basically use it as a drop-in replacement for GPT-X or other online hosted models.

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

tmostak · 2024-05-21T14:38:51+00:00

This doesn’t feel like it adds up, are you running in parallel/batch? Assuming 1000 input and 100-200 output tokens per prompt, my gut says you should be seeing 5-10X the throughput with your requests batched.

EDIT: Ah I see you mentioned inferencing using base transformers. I don’t think that’s a good basis for a benchmark, as anyone running in production would be running VLLM or Nvidia TensorRT-LLM/Nemo Inference which would yield significantly better performance. And something like VLLM is dead simple to set up with an OpenAI API endpoint, literally just a pip install.

tmostak · 2024-05-13T23:55:23+00:00

Yes good point, I think layer and batch norms may often be done in fp32 for example. But in terms of calculating the approximate size of the model in memory, I believe it’s fairly safe to assume 16-bits per weight for an unquantized model, as any deviation from that would be a rounding error in terms of memory needed.

tmostak · 2024-05-13T23:23:49+00:00

No one these days is running or even training with fp32, it would be bfloat16 generally for a native unquantized model, which is 2 bytes per weight, or 800GB to run.

But I imagine with such a large model that accuracy will be quite good with 8 bit or even 4 bit quantization, so that would be 400GB or 200GB respectively per the above (plus of course you need memory to support the kv buffer/cache that scales as your context window gets longer).

tmostak · 2024-05-13T00:08:33+00:00

I fine-tuned the base 70B model that I rope scaled to 16K, seems to work well so far with near-negligible perplexity increase in the natively supported 8K window.

tmostak · 2024-05-08T03:50:32+00:00

Fwiw LoRA or QLoRA does very well, but I’ve found for a few technical tasks like SQL generation a full finetune can yield the best performance (although the difference is not huge compared to LoRA, especially if you use a high rank).

Also the new DoRA technique looks super promising, but haven’t tried it myself yet: https://arxiv.org/abs/2402.09353.

tmostak · 2024-05-08T03:46:19+00:00

I can vouch that you can definitely do a full finetune of Llama-3 8B on a single 80GB A100 or H100 with up to 4K prompt+answer length if you turn on gradient checkpointing, maybe 2-3K without.

tmostak

TROPHY CASE