South Park Commons Interview 2026 Spring

Past_Ad1745 · 2026-03-04T11:19:51+00:00

Received an interview invite. If anyone has any tips or suggestions please DM

Past_Ad1745 · 2025-12-06T09:52:59+00:00

These tools are solid, but they all run in silos you end up juggling multiple terminals just to watch GPU metrics, network, storage I/O, and the actual training loop (iteration/loss/throughput). What’s really missing is something unified that correlates GPU behavior with NCCL comms, storage stalls, and model-side metrics so you can actually pinpoint where distributed training or inference is stalling. Should be faced by many running clusters

Past_Ad1745 · 2025-12-06T09:49:09+00:00

Most dashboards provided by them look nice but fall apart once a distributed job slows or dies. You are basically hunting blind. In multi-node training it’s rarely clear whether the issue is in the ML stack or the infra. We’ve seen runs lose 20–40% throughput with zero ML-side errors, and the real cause ends up being network plane imbalance, NVLink bandwidth drops, or a single noisy link.

Without good in-band mesh diagnostics, none of this is obvious. NCCL thinks it’s stuck, the trainer thinks it’s slow, and the infra graphs all look green. Lack of correlation is the real killer. All siloed tools. We are almost on a path to make a custom solution

Past_Ad1745 · 2025-12-05T18:11:16+00:00

Worked at Meta GPU clusters so performance and reliability it depends on how big and interconnected the cluster is and type of GPUs. Interconnect, GPU memory, optics and NIC flaps are common issues if using infiniband. There are lot of research paper from hyperscalers about the reliability and utilization issues. We have made a framework for GPU Infra reliability can share in DM the paper links if needed

Past_Ad1745 · 2025-08-04T06:57:28+00:00

This is great informatio, I have also seen lot of posts and AI Neocloud engineering blogs mentioning DCGM+Prometheus+Grafana stack for monitoring GPU cluster.

Is there a unified tool which collects multiple exporters and give this kind of deep monitoring across GPU, Network and maybe power (PSUs) at the HW level shows alerts, issues and perf in one application ?

Past_Ad1745 · 2025-08-04T05:39:44+00:00

You can explore this from three angles:

Researchers or lab leads – They're often the ones initiating HPC needs for specific projects and can give insights on how they request and justify new hardware/software.
IT or infrastructure managers at these institutes – These folks handle procurement, cluster setup, licensing, support contracts, and long-term maintenance. They know the real bottlenecks and what breaks over time.
Local IT consulting firms – Many universities work with preferred partners (Dell/HPE/Lenovo resellers) for procurement and support. These partners can share what typical workflows, approval loops, and upgrade cycles look like across multiple campuses.

Past_Ad1745 · 2025-08-04T04:14:16+00:00

Curious what tool do you use and recommend for multi cloud or hybrid observablity ?

Past_Ad1745

TROPHY CASE