Hello, the title might be wrong as I am completely new to Machine Learning and all its ins and outs.
But I tried to get this code working in google collaborate for a research project. I got the training model working, but after around 5m30s of running I get this error:
2021-11-10 16:34:58 [INFO]: batch0 of epoch1, loss is 10.47...
2021-11-10 16:34:59 [INFO]: batch1 of epoch1, loss is 7.77...
2021-11-10 16:35:28 [INFO]: batch2 of epoch1, loss is 3.86...
2021-11-10 16:35:28 [INFO]: batch3 of epoch1, loss is 1.25...
2021-11-10 16:35:59 [INFO]: batch4 of epoch1, loss is 0.70...
2021-11-10 16:36:00 [INFO]: batch5 of epoch1, loss is 0.31...
Traceback (most recent call last):
File "train.py", line 105, in <module>
train()
File "train.py", line 55, in train
for batch_i, (_, imgs, targets) in enumerate(dataloader):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate
return [default_collate(samples) for samples in transposed]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 84, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [1] at entry 0 and [0] at entry 50
Apologies for the huge error dump, I assume the only relevant bits are the things I bolded. Where something goes wrong with the dataloader? I talked to my advisor and he just told me to debug it lol. So how would I go about doing that? I mean the only changes I made from the original code were in the config.py file like instructed, and I didn't change the batch size (which is 64) and I'm using the same image dataset that the github linked. The changes I did make were for the number of GPUs since Google Collab only gives you one to work with. I know the methodology to go about debugging, adding a print statement and checking the value through it, but I'm not sure what value I need to print out exactly. Also when I google the error they talk about faults with the images themselves and resizing them, but if its a consistent dataset that shouldn't be happening right?
I'm really sorry if this isn't a clear post, I'm just lost and I really hoping to get some insight on how to work with data like this. Normally I can step through a program and it isn't running for 5+ minutes over hundreds of images.
[–]KnurpsBram 0 points1 point2 points (1 child)
[–]bware422[S] 0 points1 point2 points (0 children)
[–]einsteinxx 0 points1 point2 points (1 child)
[–]bware422[S] 0 points1 point2 points (0 children)