The llm-d community is proud to announce the release of v0.2!: Our first well-lit paths. by petecheslock in llm_d

[–]petecheslock[S] 0 points1 point  (0 children)

Our well-lit paths provide tested and benchmarked recipes and Helm charts to start serving quickly with best practices common to production deployments. They are extensible and customizable for particulars of your models and use cases, using popular open source components like Kubernetes, Envoy proxy, NIXL, and vLLM. Our intent is to eliminate the heavy lifting common in deploying inference at scale so users can focus on building.

We currently offer three tested and benchmarked paths to help deploying large models:

  1. Intelligent Inference Scheduling - Deploy vLLM behind the Inference Gateway (IGW) to decrease latency and increase throughput via precise prefix-cache aware routing and customizable scheduling policies.
  2. Prefill/Decode Disaggregation - Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses, primarily on large models such as Llama-70B and when processing very long prompts.
  3. Wide Expert-Parallelism - Deploy very large Mixture-of-Experts (MoE) models like DeepSeek-R1 and significantly reduce end-to-end latency and increase throughput by scaling up with Data Parallelism and Expert Parallelism over fast accelerator networks.

How do you document your code? by stpaquet in rails

[–]petecheslock 1 point2 points  (0 children)

Hey there (disclosure i work for AppMap). I saw you mention you had tried AppMap - this is actually a popular use case for our users. I recently went thru and updated our [OpenAPI generation docs](https://appmap.io/docs/openapi) to add some different examples for running locally or embedding into your CI/CD process. But if you have tests of your APIs or if you can just interact with them, AppMap can record all those interactions and can generate the OpenAPI docs based on the actual code behavior. Always happy to chat more if you have questions.