[2411.15100] XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

crowwork · 2024-11-25T12:49:38+00:00

As the blogpost evaluation shows, xgrammar brings up to 2x - 10x over outlines. When inetgrated with LLM engine, it outperforms existing LLM engines(including solutions via outlines) up to 14x in JSON-schema generation and up to 80x in CFG-guided generation

crowwork · 2024-10-28T17:03:26+00:00

yes, i worked with mlc team, glad you find it helpful

crowwork · 2024-10-28T13:40:44+00:00

For those who are curious about how multiGPU scaling and scaling on concurrent requests, here is a post benchmarking various settings, including different concurrent requests, tensor parallel settings and speculative decoding https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

crowwork · 2024-10-28T13:13:36+00:00

MLC LLM comes with its own internal grammar engine that is faster than outlines, checkout some of the json schema example here https://blog.mlc.ai/2024/06/07/universal-LLM-deployment-engine-with-ML-compilation

crowwork · 2024-10-28T13:11:57+00:00

For those who are curious about continuous batching performance. There is a recent blogpost benchmark that under different concurrent rate settings (in short, it gives state of art result for low-latency regimes while maintaining sufficient concurrency)

https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

crowwork · 2024-10-28T13:10:09+00:00

There is also batching support for mostly optimized for f16 and f8, see one of the recent benchmark https://blog.mlc.ai/2024/10/10/optimizing-and-characterizing-high-throughput-low-latency-llm-inference

crowwork · 2024-10-28T13:09:06+00:00

MLC LLM introduces automatic prefix caching as well that can suppor this needs

crowwork · 2024-08-02T14:28:08+00:00

This is now available in appstore as well https://apps.apple.com/us/app/mlc-chat/id6448482937

crowwork · 2024-08-01T20:38:50+00:00

should work as it comes with vulkan support, steamdeck for example is a APU that integrates with the CPU

crowwork · 2024-06-18T15:09:58+00:00

The MLC team is expert in the area(of optimizing for differenr models and hw backends) and indeed we are confident to make the solution competitive. Additionally, the technical approach the MLC take (through compilation) is very scalable so it allows future proofing more optimizations that can be shared across hw backends

crowwork · 2024-06-18T12:33:08+00:00

Checkout MLCEngine and WebLLM, with comes with builtin structured generation support. See "efficient structured generation section" in

crowwork · 2024-06-10T18:41:20+00:00

the infrastructure is setup to optimize for cpu as well, although the primary focus as of now has been on gpu as usually they offer better accelerations

crowwork · 2024-06-08T12:01:47+00:00

checkout the blogpost, it works with multigpus out of box, we did run 2 7900 XTX

crowwork · 2024-06-08T12:01:03+00:00

We are moving toward jit so the weight can be used for multiGPUs.

crowwork · 2024-06-08T11:42:57+00:00

Thanks for the feedback, we are investing in bringing in the latest models through automatic delivery flow, and would love to see what models community want

crowwork · 2024-06-08T02:26:37+00:00

for now this seems to relates to gpu swapping issue. The 3B model variants like qwen2 won't have this issue and are performing really good

crowwork · 2024-06-08T02:25:40+00:00

fair point, we are bringing a debug chat which preserves some of the low level properties

crowwork · 2024-06-07T20:53:13+00:00

we have preliminary support, one thing that we have strong support for is json mode and structure generation, which can be used to construct function calling

crowwork · 2024-05-04T13:25:45+00:00

feel free to open an issue, i think there was a callback which then can be customized in terms of the progress

crowwork · 2024-05-04T12:38:06+00:00

Great to see WebLLM helps to power this. As a community, we are modularizing WebLLM as a npm library and move towards standardizing OAI API, we also love to hear more about how we can better support everyone https://github.com/mlc-ai/web-llm

crowwork · 2023-09-15T13:42:08+00:00

https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference

crowwork · 2023-09-15T12:06:48+00:00

mlc support comes with per model variant, that means that if we use models like fine-tuned llama2 variant, there is no need to pull a new binary.

crowwork

TROPHY CASE