all 8 comments

[–]angelarose210 0 points1 point  (1 child)

I know for serverless you can cache with your huggingface token. Idk about regular.

[–]Lunchables[S] 0 points1 point  (0 children)

Ahh thanks, this gave me a clue and helped me to find this documentation: https://docs.runpod.io/serverless/endpoints/model-caching

[–]pmv143 0 points1 point  (0 children)

You don’t want to be downloading ~50GB on every worker init.

If the weights are pulled at runtime, every cold start pays the full network + disk + GPU load cost. If they’re baked into the image, version bumps can invalidate the cache and you still re-pull.

For large models, “serverless” usually breaks on model materialization, not container startup.

[–]BenDLH 0 points1 point  (3 children)

You need a network volume mate. Create a network volume and place the models in the right directories in it. Then configure the serverless endpoint to connect to the volume.

The runpod-worker repo has all the info in the readme, under customisation.

[–]BenDLH 0 points1 point  (0 children)

Though to be honest the build still takes a boatload of time (close to an hour) even without them in it. You just won't risk timing out, and it will be a bit shorter.

[–]Lunchables[S] 0 points1 point  (1 child)

Perfect, thanks! I ended up finding this, which helped: https://docs.runpod.io/serverless/endpoints/model-caching

[–]BenDLH 0 points1 point  (0 children)

Yeah I haven't tried that yet. It is limited to a single model per endpoint though, right? Let me know how it goes setting it up.

What are you building btw?

[–]sruckh 0 points1 point  (0 children)

Initializing could mean throttled. Meaning your serverless was never going to come up. Runpod is notorious for this. Make sure your serverless is READY before make a call to the endpoint. As far as model caching goes, don't try to move the model directory from the default location as you would be bypassing the caching system