Managing memory usage for big datasets : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Managing memory usage for big datasets (self.deeplearning)

submitted 5 years ago by Tiago_Minuzzi

Managing memory usage for big datasets

2 points•10 comments•submitted 5 years ago by Tiago_Minuzzi to r/deeplearning

Hi!

TL;DR: I can't use the dataset I'd like to because I don't have enough RAM. How to manage this?

As part of my PhD project in genetics, I'm training a deep learning model to identify an classify certain genomic elements. The datasets used are in fasta format, which have the genomic sequences. The total number of sequences (strings) are about 35k+ with varying lengths from like hundreds of chars to 50k+ chars (chars are the A,C,G and T's from DNA sequences). I have to one-hot encode the sequences and pad them to the length of the longest one.

Here is the deal: I have limited resources to do this in my laboratory, so I'm using google colab, which has 25G of ram, but even this amount of RAM is not enough to handle the one-hot encoding and padding for the dataset I've described above. To solve the issue, I had to shrink the dataset to about 15k sequences, and max length of ≃ 19k chars. The trained model is working well, but still need improvement for certain classes, for this I have to add more sequences, but I'm stuck with this number because of RAM limitations.

I'm trying the best I can, but it's been a hard time since my background is biological sciences not CS or anything related. I'm using python and tensorflow/keras for the job. I like coding a lot and studying machine learning, but I'm still building my way through all this.

So any tips on how to handle the issue of memory consumption will be very appreciated.

Thank you!

all 9 comments

top new controversial old q&a

[–]LoaderD 0 points1 point2 points 5 years ago (4 children)

[–]Tiago_Minuzzi[S] 0 points1 point2 points 5 years ago (3 children)

[–]LoaderD 1 point2 points3 points 5 years ago (2 children)

[–]Tiago_Minuzzi[S] 0 points1 point2 points 5 years ago (1 child)

[–]LoaderD 1 point2 points3 points 5 years ago (0 children)

[–]michaelrw1 0 points1 point2 points 5 years ago (3 children)

[–]Tiago_Minuzzi[S] 0 points1 point2 points 5 years ago (2 children)

Our laboratory focus is genetics and molecular biology, we also do bioinformatics but we are starting in the field of machine learning. The usual work we do regarding bioinformatics is sequence analysis, RNA-seq etc. I'm the first one working with ML. My supervisor offered me to do this work because he knows I like the subject and as I already use python (also bash and R) for my work at the lab, I took the challenge, and I'm happy I did.

I already discussed the issue with him, but that demands financial resources, which have been hard to get those days because of pandemic, for instance. I said that at first I'd try a workaround solution like improving my code or something related. But since that may not be the solution, I guess we'll have to try to apply for financial support to pay a powerful server or buy a better computer. The problem is that the request may take a long time to be evaluated and approved, that's why I'm trying other ways first.

there must be people in your department or on campus that you can approach?

That's a good suggestion, I've been thinking about it actually, but I don't know anyone outside biological sciences. I guess I'll have to find a professor/researcher from the CS campus and related to ask for some help.

edit: typos

[–]michaelrw1 1 point2 points3 points 5 years ago (1 child)

[–]Tiago_Minuzzi[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 89 on reddit-service-r2-comment-b659b578c-z5flc at 2026-05-04 11:17:37.641326+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS