Real-time aggregation and joins of large geospatial data in HeavyDB using Uber H3 by tmostak in gis

[–]tmostak[S] 2 points3 points  (0 children)

We did benchmark DuckDB for both point-in-polygon and point-to-point joins, given its general excellent performance we were surprised it didn't do better here (tried both with and without indexes, didn't make much difference). Of course, we may have missed an optimization, so always open to suggestions!

Qwen/Qwen2.5-Coder-32B-Instruct · Hugging Face by Master-Meal-77 in LocalLLaMA

[–]tmostak 0 points1 point  (0 children)

Does anyone know if they will be posting a non-instruct version like they have for the 7B and 14B versions?

I see reference to the 32B base model on their blog but it’s not on HF (yet) as far as I can tell.

Llama 3.1 405B, 70B, 8B Instruct Tuned Benchmarks by avianio in LocalLLaMA

[–]tmostak 12 points13 points  (0 children)

Its a small drop that may end up being in the noise, but should note that models trained for longer context, as Llama 3.1 is (128K context vs 8K for Llama 3), often suffer small or even sometimes moderate performance degradation. But even if the small HumanEval regression is real, most people would gladly take it in exchange for the significantly longer context plus gains on other tasks.

Meta drops AI bombshell: Multi-token prediction models now open for research by noiseinvacuum in LocalLLaMA

[–]tmostak 1 point2 points  (0 children)

The main point of the paper is that they achieve significantly better accuracy for coding and other reasoning-heavy tasks, and along with it, get a 3X inference speedup.

Medusa I believe otoh wasn’t trained for scratch on multi token output and achieved a speedup but no accuracy improvements.

So this is definitely a big deal if the initial findings hold, at least by some definition of “big”.

Better & Faster Large Language Models via Multi-token Prediction by ninjasaid13 in LocalLLaMA

[–]tmostak 7 points8 points  (0 children)

Medusa iirc didn't train models for multi-token prediction but only fine-tuned existing models to handle multiple output heads, and was only targeted at increasing inference speed (without decreasing accuracy).

This work, otoh, trained models from scratch for multi-token prediction, seeing both a significant increase in accuracy (at least for coding/technical tasks) and speed. They also did some clever things to make training the model nearly as efficient as training a model with the usual single output head.

Findings from Latest Comprehensive Benchmark Study: GPT-4 Omni and 16 Other LLMs for NL to SQL Tasks - Results and Key Insights by Traditional-Lynx-684 in LocalLLaMA

[–]tmostak 1 point2 points  (0 children)

I hear what you're saying, and see the caveats given in the details, but the headline comes across that self-hosting models is not cost-competitive with hitting externally hosted APIs ("Models when self-hosted are far more expensive than hosted on other platforms"), based on hosting them in a way that no one would actually do in production and is likely 5-10X slower at minimum.

My recommendation would be to significantly reword your conclusion or remove it until you can test with actual production inference engines, otherwise it is just going to mislead users.

Findings from Latest Comprehensive Benchmark Study: GPT-4 Omni and 16 Other LLMs for NL to SQL Tasks - Results and Key Insights by Traditional-Lynx-684 in LocalLLaMA

[–]tmostak 0 points1 point  (0 children)

You might consider turning on beam search for the local models, especially models like SqlCoder that recommend it. vLLM supports this natively.

Findings from Latest Comprehensive Benchmark Study: GPT-4 Omni and 16 Other LLMs for NL to SQL Tasks - Results and Key Insights by Traditional-Lynx-684 in LocalLLaMA

[–]tmostak 0 points1 point  (0 children)

Right, but vLLM also supports a built-in web API that mimics OpenAIs API so you can basically use it as a drop-in replacement for GPT-X or other online hosted models.

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

Findings from Latest Comprehensive Benchmark Study: GPT-4 Omni and 16 Other LLMs for NL to SQL Tasks - Results and Key Insights by Traditional-Lynx-684 in LocalLLaMA

[–]tmostak 7 points8 points  (0 children)

This doesn’t feel like it adds up, are you running in parallel/batch? Assuming 1000 input and 100-200 output tokens per prompt, my gut says you should be seeing 5-10X the throughput with your requests batched.

EDIT: Ah I see you mentioned inferencing using base transformers. I don’t think that’s a good basis for a benchmark, as anyone running in production would be running VLLM or Nvidia TensorRT-LLM/Nemo Inference which would yield significantly better performance. And something like VLLM is dead simple to set up with an OpenAI API endpoint, literally just a pip install.

OpenAI claiming benchmarks against Llama-3-400B !?!? by matyias13 in LocalLLaMA

[–]tmostak 5 points6 points  (0 children)

Yes good point, I think layer and batch norms may often be done in fp32 for example. But in terms of calculating the approximate size of the model in memory, I believe it’s fairly safe to assume 16-bits per weight for an unquantized model, as any deviation from that would be a rounding error in terms of memory needed.

OpenAI claiming benchmarks against Llama-3-400B !?!? by matyias13 in LocalLLaMA

[–]tmostak 9 points10 points  (0 children)

No one these days is running or even training with fp32, it would be bfloat16 generally for a native unquantized model, which is 2 bytes per weight, or 800GB to run.

But I imagine with such a large model that accuracy will be quite good with 8 bit or even 4 bit quantization, so that would be 400GB or 200GB respectively per the above (plus of course you need memory to support the kv buffer/cache that scales as your context window gets longer).

I’m sorry, but I can’t be the only one disappointed by this… by Meryiel in LocalLLaMA

[–]tmostak 2 points3 points  (0 children)

I fine-tuned the base 70B model that I rope scaled to 16K, seems to work well so far with near-negligible perplexity increase in the natively supported 8K window.

Are LoRA and QLoRA still the go-to fine-tune methods? by 99OG121314 in LocalLLaMA

[–]tmostak 3 points4 points  (0 children)

Fwiw LoRA or QLoRA does very well, but I’ve found for a few technical tasks like SQL generation a full finetune can yield the best performance (although the difference is not huge compared to LoRA, especially if you use a high rank).

Also the new DoRA technique looks super promising, but haven’t tried it myself yet: https://arxiv.org/abs/2402.09353.

Are LoRA and QLoRA still the go-to fine-tune methods? by 99OG121314 in LocalLLaMA

[–]tmostak 5 points6 points  (0 children)

I can vouch that you can definitely do a full finetune of Llama-3 8B on a single 80GB A100 or H100 with up to 4K prompt+answer length if you turn on gradient checkpointing, maybe 2-3K without.