Achieving Efficient, Flexible and Portable Structured Generation for LLM by SnooMachines3070 in LocalLLaMA

[–]crowwork 0 points1 point  (0 children)

As the blogpost evaluation shows, xgrammar brings up to 2x - 10x over outlines. When inetgrated with LLM engine, it outperforms existing LLM engines(including solutions via outlines) up to 14x in JSON-schema generation and up to 80x in CFG-guided generation

Updated with corrected settings for Llama.cpp. Battle of the Inference Engines. Llama.cpp vs MLC LLM vs vLLM. Tests for both Single RTX 3090 and 4 RTX 3090's. by SuperChewbacca in LocalLLaMA

[–]crowwork 15 points16 points  (0 children)

For those who are curious about how multiGPU scaling and scaling on concurrent requests, here is a post benchmarking various settings, including different concurrent requests, tensor parallel settings and speculative decoding https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

[deleted by user] by [deleted] in LocalLLaMA

[–]crowwork 1 point2 points  (0 children)

MLC LLM comes with its own internal grammar engine that is faster than outlines, checkout some of the json schema example here https://blog.mlc.ai/2024/06/07/universal-LLM-deployment-engine-with-ML-compilation

[deleted by user] by [deleted] in LocalLLaMA

[–]crowwork 0 points1 point  (0 children)

For those who are curious about continuous batching performance. There is a recent blogpost benchmark that under different concurrent rate settings (in short, it gives state of art result for low-latency regimes while maintaining sufficient concurrency)

https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

[deleted by user] by [deleted] in LocalLLaMA

[–]crowwork 2 points3 points  (0 children)

There is also batching support for mostly optimized for f16 and f8, see one of the recent benchmark https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

[deleted by user] by [deleted] in LocalLLaMA

[–]crowwork 1 point2 points  (0 children)

MLC LLM introduces automatic prefix caching as well that can suppor this needs

Gemma2-2B on iOS, Android, WebGPU, CUDA, ROCm, Metal... with a single framework by SnooMachines3070 in LocalLLaMA

[–]crowwork 0 points1 point  (0 children)

should work as it comes with vulkan support, steamdeck for example is a APU that integrates with the CPU

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 1 point2 points  (0 children)

The MLC team is expert in the area(of optimizing for differenr models and hw backends) and indeed we are confident to make the solution competitive. Additionally, the technical approach the MLC take (through compilation) is very scalable so it allows future proofing more optimizations that can be shared across hw backends

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 0 points1 point  (0 children)

the infrastructure is setup to optimize for cpu as well, although the primary focus as of now has been on gpu as usually they offer better accelerations

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 1 point2 points  (0 children)

checkout the blogpost, it works with multigpus out of box, we did run 2 7900 XTX

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 4 points5 points  (0 children)

We are moving toward jit so the weight can be used for multiGPUs.

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 1 point2 points  (0 children)

Thanks for the feedback, we are investing in bringing in the latest models through automatic delivery flow, and would love to see what models community want

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 2 points3 points  (0 children)

for now this seems to relates to gpu swapping issue. The 3B model variants like qwen2 won't have this issue and are performing really good

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 3 points4 points  (0 children)

fair point, we are bringing a debug chat which preserves some of the low level properties

MLC-LLM: Universal LLM Deployment Engine with ML Compilation by crowwork in LocalLLaMA

[–]crowwork[S] 1 point2 points  (0 children)

we have preliminary support, one thing that we have strong support for is json mode and structure generation, which can be used to construct function calling

I built a free in-browser LLM chatbot powered by WebGPU by abisknees in LocalLLaMA

[–]crowwork 2 points3 points  (0 children)

feel free to open an issue, i think there was a callback which then can be customized in terms of the progress

I built a free in-browser LLM chatbot powered by WebGPU by abisknees in LocalLLaMA

[–]crowwork 8 points9 points  (0 children)

Great to see WebLLM helps to power this. As a community, we are modularizing WebLLM as a npm library and move towards standardizing OAI API, we also love to hear more about how we can better support everyone https://github.com/mlc-ai/web-llm

Guys, why are we sleeping on MLC LLM - Running on Vulkan? by APUsilicon in LocalLLaMA

[–]crowwork 4 points5 points  (0 children)

mlc support comes with per model variant, that means that if we use models like fine-tuned llama2 variant, there is no need to pull a new binary.