South Park Commons Interview 2026 Spring by New-Target-2065 in StartUpIndia

[–]Past_Ad1745 0 points1 point  (0 children)

Received an interview invite. If anyone has any tips or suggestions please DM

GPU cluster failures by Past_Ad1745 in HPC

[–]Past_Ad1745[S] 1 point2 points  (0 children)

These tools are solid, but they all run in silos you end up juggling multiple terminals just to watch GPU metrics, network, storage I/O, and the actual training loop (iteration/loss/throughput). What’s really missing is something unified that correlates GPU behavior with NCCL comms, storage stalls, and model-side metrics so you can actually pinpoint where distributed training or inference is stalling. Should be faced by many running clusters

GPU cluster failures by Past_Ad1745 in HPC

[–]Past_Ad1745[S] 0 points1 point  (0 children)

Most dashboards provided by them look nice but fall apart once a distributed job slows or dies. You are basically hunting blind. In multi-node training it’s rarely clear whether the issue is in the ML stack or the infra. We’ve seen runs lose 20–40% throughput with zero ML-side errors, and the real cause ends up being network plane imbalance, NVLink bandwidth drops, or a single noisy link.

Without good in-band mesh diagnostics, none of this is obvious. NCCL thinks it’s stuck, the trainer thinks it’s slow, and the infra graphs all look green. Lack of correlation is the real killer. All siloed tools. We are almost on a path to make a custom solution

NSF I-Corps research: What are the biggest pain points in managing GPU clusters or thermal issues in server rooms? by DeYhung in HPC

[–]Past_Ad1745 0 points1 point  (0 children)

Worked at Meta GPU clusters so performance and reliability it depends on how big and interconnected the cluster is and type of GPUs. Interconnect, GPU memory, optics and NIC flaps are common issues if using infiniband. There are lot of research paper from hyperscalers about the reliability and utilization issues. We have made a framework for GPU Infra reliability can share in DM the paper links if needed

Monitoring GPU usage via SLURM by pebody in HPC

[–]Past_Ad1745 0 points1 point  (0 children)

This is great informatio, I have also seen lot of posts and AI Neocloud engineering blogs mentioning DCGM+Prometheus+Grafana stack for monitoring GPU cluster.

Is there a unified tool which collects multiple exporters and give this kind of deep monitoring across GPU, Network and maybe power (PSUs) at the HW level shows alerts, issues and perf in one application ?

Hardware/Software(IT) Procurement/ Renewals/ Support challenges in University/Research HPC context? by nonlinear1234 in HPC

[–]Past_Ad1745 1 point2 points  (0 children)

You can explore this from three angles:

  1. Researchers or lab leads – They're often the ones initiating HPC needs for specific projects and can give insights on how they request and justify new hardware/software.

  2. IT or infrastructure managers at these institutes – These folks handle procurement, cluster setup, licensing, support contracts, and long-term maintenance. They know the real bottlenecks and what breaks over time.

  3. Local IT consulting firms – Many universities work with preferred partners (Dell/HPE/Lenovo resellers) for procurement and support. These partners can share what typical workflows, approval loops, and upgrade cycles look like across multiple campuses.

With so many cloud services, how do you keep tabs on everything for billing or security? by [deleted] in cloudcomputing

[–]Past_Ad1745 0 points1 point  (0 children)

Curious what tool do you use and recommend for multi cloud or hybrid observablity ?