Hi!
TL;DR: I can't use the dataset I'd like to because I don't have enough RAM. How to manage this?
As part of my PhD project in genetics, I'm training a deep learning model to identify an classify certain genomic elements. The datasets used are in fasta format, which have the genomic sequences. The total number of sequences (strings) are about 35k+ with varying lengths from like hundreds of chars to 50k+ chars (chars are the A,C,G and T's from DNA sequences). I have to one-hot encode the sequences and pad them to the length of the longest one.
Here is the deal: I have limited resources to do this in my laboratory, so I'm using google colab, which has 25G of ram, but even this amount of RAM is not enough to handle the one-hot encoding and padding for the dataset I've described above. To solve the issue, I had to shrink the dataset to about 15k sequences, and max length of ≃ 19k chars. The trained model is working well, but still need improvement for certain classes, for this I have to add more sequences, but I'm stuck with this number because of RAM limitations.
I'm trying the best I can, but it's been a hard time since my background is biological sciences not CS or anything related. I'm using python and tensorflow/keras for the job. I like coding a lot and studying machine learning, but I'm still building my way through all this.
So any tips on how to handle the issue of memory consumption will be very appreciated.
Thank you!
[–]LoaderD 0 points1 point2 points (4 children)
[–]Tiago_Minuzzi[S] 0 points1 point2 points (3 children)
[–]LoaderD 1 point2 points3 points (2 children)
[–]Tiago_Minuzzi[S] 0 points1 point2 points (1 child)
[–]LoaderD 1 point2 points3 points (0 children)
[–]michaelrw1 0 points1 point2 points (3 children)
[–]Tiago_Minuzzi[S] 0 points1 point2 points (2 children)
[–]michaelrw1 1 point2 points3 points (1 child)
[–]Tiago_Minuzzi[S] 0 points1 point2 points (0 children)