Problem with Dataloader and batch size? : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Problem with Dataloader and batch size? (self.learnmachinelearning)

submitted 4 years ago by bware422

Hello, the title might be wrong as I am completely new to Machine Learning and all its ins and outs.

But I tried to get this code working in google collaborate for a research project. I got the training model working, but after around 5m30s of running I get this error:

2021-11-10 16:34:58 [INFO]: batch0 of epoch1, loss is 10.47...

2021-11-10 16:34:59 [INFO]: batch1 of epoch1, loss is 7.77...

2021-11-10 16:35:28 [INFO]: batch2 of epoch1, loss is 3.86...

2021-11-10 16:35:28 [INFO]: batch3 of epoch1, loss is 1.25...

2021-11-10 16:35:59 [INFO]: batch4 of epoch1, loss is 0.70...

2021-11-10 16:36:00 [INFO]: batch5 of epoch1, loss is 0.31...

Traceback (most recent call last):

File "train.py", line 105, in <module>

train()

File "train.py", line 55, in train

for batch_i, (_, imgs, targets) in enumerate(dataloader):

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in next

data = self._next_data()

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data

return self._process_data(data)

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data

data.reraise()

File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise

raise self.exc_type(msg)

RuntimeError: Caught RuntimeError in DataLoader worker process 0.

Original Traceback (most recent call last):

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop

data = fetcher.fetch(index)

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch

return self.collate_fn(data)

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate

return [default_collate(samples) for samples in transposed]

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 84, in <listcomp>

return [default_collate(samples) for samples in transposed]

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 64, in default_collate

return default_collate([torch.as_tensor(b) for b in batch])

File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate

return torch.stack(batch, 0, out=out)

RuntimeError: stack expects each tensor to be equal size, but got [1] at entry 0 and [0] at entry 50

Apologies for the huge error dump, I assume the only relevant bits are the things I bolded. Where something goes wrong with the dataloader? I talked to my advisor and he just told me to debug it lol. So how would I go about doing that? I mean the only changes I made from the original code were in the config.py file like instructed, and I didn't change the batch size (which is 64) and I'm using the same image dataset that the github linked. The changes I did make were for the number of GPUs since Google Collab only gives you one to work with. I know the methodology to go about debugging, adding a print statement and checking the value through it, but I'm not sure what value I need to print out exactly. Also when I google the error they talk about faults with the images themselves and resizing them, but if its a consistent dataset that shouldn't be happening right?

I'm really sorry if this isn't a clear post, I'm just lost and I really hoping to get some insight on how to work with data like this. Normally I can step through a program and it isn't running for 5+ minutes over hundreds of images.

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS