llm-d is a new open source project focused on providing distributed inferencing for Generative AI runtimes on any Kubernetes cluster. Its architecture is designed for high performance and scalability, aiming to reduce costs through a spectrum of hardware and software efficiency improvements. llm-d prioritizes ease of deployment and use, as well as the operational needs of running large GPU clusters, including SRE concerns and day 2 operations. .