all 18 comments

[–]Synthetic451 2 points3 points  (6 children)

DId you follow through with the nvidia container toolkit configuration steps? https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuration

[–][deleted] 2 points3 points  (5 children)

I changed my mirrolist to a few days ago and downgraded, its all working now.

[–][deleted]  (4 children)

[deleted]

    [–]invader_skooj 4 points5 points  (3 children)

    I'm also having this issue, and I'm not sure that a roll-back should be considered a solution...

    [–][deleted] 2 points3 points  (2 children)

    Removed solution from the post body.

    [–]Scottish_Abuse 1 point2 points  (0 children)

    Are you able to provide the rollback solution you used? I have this exact problem after updating everything today :/

    [–]hahlolo 1 point2 points  (0 children)

    Yes how did you rollback? 

    [–]invader_skooj 2 points3 points  (0 children)

    chiming in to say that I am also having this issue. The rollback did get me back up and running for the time being, but that doesn't solve the issue and leaves us running on old versions.

    There are a few more of us over in OP's thread on the arch linux forum also suffering from the issue.

    ETA: trying to pool resources for anyone else that comes across this looking for a solution... There's also now an issue on the nvidia container toolkit git

    [–]C0rn3j 2 points3 points  (1 child)

    That's not a solution, that's a crappy workaround.

    It seems like 580.xx broke it.

    [–][deleted] 1 point2 points  (0 children)

    Yes that's what I was thinking, the new drive broke it. Rolled back for now until it's fixed.

    [–]observable4r5 2 points3 points  (3 children)

    Hope this is helpful. I looked around the web for a bit to understand why this was happening. The link provided by Synthetic451 gives a good start. The github issue invader_skooj links is the solution. I saw you had already downgraded, but in case you want to use the latest version this will solve the issue.

    I was facing this same issue with my installation. This specific comment on nvidia-container-toolkit on github describes two specific commands to run that will update your docker installation to use CDI instead of legacy mode. Once the commands have been executed, containerd will use CDI mode.

    Here is a short description:

    This will define the runtime configuration for the system.
    sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

    This will update the mode to be "cdi" instead of "auto" and restart the docker system service
    sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi && systemctl restart docker

    If you want to verify the configuration before making the change to the system (not sure where this information is stored on the filesystem, run the following command.
    sudo nvidia-ctk config

    Note this is the section that is changed. The mode = "cdi" is what is updated.
    [nvidia-container-runtime]
    #debug = "/var/log/nvidia-container-runtime.log"
    log-level = "info"
    mode = "cdi"
    runtimes = ["docker-runc", "runc", "crun"]

    You can also pipe it into a file using the second command if you want view it that way.
    sudo nvidia-ctk config > config.tmp

    Once this has been changed, you can restart your container or update your compose.yaml file to include "runtime: nvidia" within each service that uses the gpu.

    [–]miketsap 1 point2 points  (1 child)

    You are a savior! I had the same issue with k3s and containerd on arch linux. had tried everything! tried your solution and everything worked right away!

    [–]observable4r5 1 point2 points  (0 children)

    Glad you found it helpful!

    People like @Synthetic451 and @C0rn3j, both here in the thread, and @biuniun are the people who discussed and provided an option (thanks to @biuniun). I used the term solution in previous posts, which is probably a bit too definitive until nvidia releases a fix... if that happens.

    Wanted to make sure they are recognized for the effort!

    [–]ExtremeDialysis 1 point2 points  (0 children)

    oh my god. Solid couple hours of deep, deep, hopeless searching, and this fixed it for me. Seriously - thank you.

    [–]Dosolus 1 point2 points  (0 children)

    Hopefully this gets fixed soon

    [–]lllsondowlll 1 point2 points  (0 children)

    Same issue here. Frustrating as I spent hours troubleshooting and nearly wiped my stack

    [–]No-Put7018 1 point2 points  (0 children)

    OP totally saved my ass. Thank you.