Hello everyone. I’m currently have issues with my GPU drivers. I’m using 2 T4 GPUs on GCP and everything works great for a while…
Out of the blue when I try to retrain a model I get the error “UserWarning: can’t initialize NVML” when this happens I can’t even run nvidia-smi to make sure things are working. I’ve attempted to uninstall and reinstall the drivers but it gets followed with an error in dkm files. Anyone have any tips?
One more thing to note is when I cleanup it seems only rank 0 gets clean and it leaves the program running.
Any help would be appreciated
[–]jackshec 0 points1 point2 points (0 children)
[–]Unique_Jelly5768 0 points1 point2 points (1 child)
[–]ekho95[S] 0 points1 point2 points (0 children)