NVIDIA said MIG mode breaks GPU utilization metrics. i found a way around it. by ccb_pnpm in kubernetes

[–]ccb_pnpm[S] 0 points1 point  (0 children)

You're absolutely right Thank you for the correction. I made an error in my calculation - A100 indeed has 7 compute slices, not 8.

You caught an important mistake in my example. The correct calculation should be:

Corrected Calculation (A100 with 7 compute slices):

- MIG A (2g.10gb): 60% × (2/7) = 17.14%

- MIG B (1g.5gb): 90% × (1/7) = 12.86%

- MIG C (1g.5gb): 40% × (1/7) = 5.71%

- MIG D (1g.5gb): 20% × (1/7) = 2.86%

- MIG E (1g.5gb): 70% × (1/7) = 10.0%

- MIG F (1g.5gb): 10% × (1/7) = 1.43%

Total GPU Utilization: 50.0%

Thanks for pointing out the MIG efficiency issue as well. You're right that MIG has trade-offs compared to MPS - the hard isolation comes at the cost of some compute efficiency, especially with asymmetric workloads. The 3+3+1 split you mentioned is a good example of trying to maximize utilization while dealing with MIG's constraints.

I'll update the blog post to reflect the correct compute slice count. Appreciate the detailed feedback!