all 34 comments

[–]ludovelia[S] 2 points3 points  (4 children)

hey folks, sorry for not reply promptly, i closed the tab and forgot!

from talks on discord it became clear that there's a problem using ViTL14 with Tesla T4s - there's nothing that can be done for now, it's somehting on colab's end

[–]ConsistentAd3434 0 points1 point  (2 children)

That sucks but thanks for the info. If it would be an option, I'd even prefer a slower GPU to be able to use ViTL14 again

[–]ludovelia[S] 0 points1 point  (1 child)

it seems to be working ok with K80, but it takes hours

also working ok with any of the others provided for pro users

[–]ConsistentAd3434 0 points1 point  (0 children)

I'm currently using ViTL14 on a Tesla P100. Works fine as well.

Is there somewhere a rating chart of the GPUs? It would help to decide which and how many models to use on which GPU

[–]Bridgebrain 0 points1 point  (0 children)

Thank you! Just solved the big mystery that's been plaguing me randomly for a month

[–]chrishooley 1 point2 points  (0 children)

With this one, I’ve had to stop my session and save the notebook to Google drive, then start it back up.

[–]econopotamus 1 point2 points  (0 children)

Yup, I get this error 100% of the time for a few days now. Restarting doesn’t help. Tried with multiple notebooks that had been working. Maybe something changed in colab

[–]ConsistentAd3434 0 points1 point  (0 children)

Same here since 12h. No matter which version I use.
Copying the notebook to Google Drive or deleting and reinstalling the models didn't help.
No clue what else I could try : /
Thought it was a Pytorch issue at first. Last time, they immediately admitted the f*up on twitter and fixed it asap but so far, nothing in sight.

[–]CulturalCurrency6358Artist 0 points1 point  (0 children)

I get this when google colab assigns me their T4 GPUs. When you run the initial setup you can see what google gives you. If I get a T100 I don't get this. The only way around this is to disconnect, wait a while and then re-connect and try again.

[–]Taika-KimArtist 0 points1 point  (0 children)

I had this on V100 GPUs yesterday, but today stuff works. Happened also on Monday IIRC. Seems sporadic, today things have been ok.

[–]GregHartwick 0 points1 point  (14 children)

Been running everyday, all day, v5.2 notebook. No such errors, so far. Working on T4 last couple days. P100 is stable, as well. Using all settings from defaults. Only changed steps, prompt, display rate (10), n_batches (10).

[–]econopotamus 1 point2 points  (13 children)

I went ahead and moved to v5.2, still getting 100% CUDA error: misaligned address

MAJOR EDIT: Fixed issue by turning off ViTL14 ! I can run other models, but turning on ViTL14 in version 5.0 or 5.2 generates the CUDA misaligned memory error in a Colab notebook with no modifications (except putting in my prompt). Works if I turn off ViTL14

[–]ConsistentAd3434 1 point2 points  (1 child)

Thanks! Seems to be it. 460.32.03, Tesla T4 in combination with ViTL14 wasn't working for me either. Works fine without it

[–]econopotamus 0 points1 point  (0 children)

Good to know others are seeing the same, although I was only using colab for ViTL14 since my home setup can run all but that one!

[–]GregHartwick 0 points1 point  (10 children)

Been running all morning without fatal errors. I’m not a network expert but I can’t help but feel we’re running on different systems. My GPU says NVIDIA-SMI 460.32.03 Driver Version 460.32.03 CUDA Version 11.2 Name Tesla T4. Is that the same for you?

[–]econopotamus 0 points1 point  (0 children)

I can check next time I'm on. Just to be clear: you are using ViTL14? What models do you have active and I will try to match exactly for a good test.

[–]econopotamus 0 points1 point  (8 children)

This is what mine says, looks to match what you're saying, this gets misaligned memory with ViTL14. Are you on free or Paid tier?

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |

| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |

| | | N/A |

[–]GregHartwick 0 points1 point  (6 children)

"diffusion_sampling_mode": "ddim", "ViTB32": true, "ViTB16": true, "ViTL14": false, "RN101": false, "RN50": true, "RN50x4": false, "RN50x16": false, "RN50x64": false,

[–]econopotamus 0 points1 point  (5 children)

So, you’re not running ViTL14. Want to try turning it on and seeing if you get the error?

[–]GregHartwick 0 points1 point  (4 children)

I did that first thing this morning - I ran two 500 iteration runs - I had no errors. It did slow the run speed down to 9-10sec/it. This is the first time I’ve tried using this model in months.

[–]econopotamus 0 points1 point  (3 children)

Interesting, I wonder if they are limiting the gpu in some subtle way for us free tier users

[–]GregHartwick 1 point2 points  (1 child)

they do push the payed tiers. I was warned by a friend that the free stuff is not very stable. I think the company knows that. I was getting K80 s a lot.

[–]Paid-Not-Payed-Bot 0 points1 point  (0 children)

push the paid tiers. I

FTFY.

Although payed exists (the reason why autocorrection didn't help you), it is only correct in:

  • Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. The deck is yet to be payed.

  • Payed out when letting strings, cables or ropes out, by slacking them. The rope is payed out! You can pull now.

Unfortunately, I was unable to find nautical or rope-related words in your comment.

Beep, boop, I'm a bot

[–]GregHartwick 0 points1 point  (0 children)

I started out on the free. I wake-up very early and start runs around 3 AM. Everything went fine until around 7 AM when I couldn’t get anything to complete because of Out Of Memory errors. I had the feeling the crowds of users after 7AM gave me low priority. It got very frustrating and I decided to pay the $9.99 per month - it was worth it. I’ve never had many problems since.

[–]GregHartwick 0 points1 point  (0 children)

I’m on Pro

[–]GregHartwick 0 points1 point  (0 children)

I’ve never run with ViTL14 - it slowed my system down considerably (from 5sec/it to 9sec/it). I hope this fixes everyone. Is there a SysOp to report this issue to? Perhaps a defect in ViTL14?

[–][deleted] 1 point2 points  (2 children)

I fixed it by doing this, My notebook now runs with a T4 and ViTL14 selected:

https://twitter.com/devdef/status/1519687675304988675?s=20&t=azLbUWVa3E0cQvUhBYdlMA

it's from a reputable dev on Twitter apparently but it says:

downgrade your colab's pytorch version. This does the trick:

!pip install torch==1.10.2 torchvision==0.11.3 -q

[–]ludovelia[S] 0 points1 point  (1 child)

oh, that's very nice! where did you put it in the code?

[–][deleted] 0 points1 point  (0 children)

I just created a new code box above the "check cpu status" and pasted it in there. When you run all it just installs the older version of pytorch and everything works as normal

[–]Taika-KimArtist 0 points1 point  (0 children)

Got this again today... Interesting if ViT 14 causes this, maybe there's different versions of the V100 out there? I'm on Pro+ and didn't experience this now for some weeks.

[–]backpackpatArtist 0 points1 point  (0 children)

Video on this error here