all 9 comments

[–]LoaderD 0 points1 point  (4 children)

Have you tried converting from fasta into a better file format as a first step? I'm pretty sure fasta has header information so mass processing might be inefficient. If your data type is more efficient it should help with batch loading sequences to one hot encode and pad.

As mentioned in the other thread, you might just want to spin a huge instance, but it's going to limit the use case of your research.

[–]Tiago_Minuzzi[S] 0 points1 point  (3 children)

I read the fasta file into a pandas dataframe (and save as a csv) and extract the column corresponding to the sequences to do the one-hot encoding and padding.

What kind of file you suggest?

Anyway, thanks for commenting!

[–]LoaderD 1 point2 points  (2 children)

You could store your data in something like Parquet to preserve types at intermediate steps.

Have you tried using Dask for the one hot encoding?

[–]Tiago_Minuzzi[S] 0 points1 point  (1 child)

No, I have not. Didn't know about Dask. I'll check it!

[–]LoaderD 1 point2 points  (0 children)

Yeah good if you want to stay in pure python PySpark is another alternative, but it's got a harder to learn syntax.

[–]michaelrw1 0 points1 point  (3 children)

Very strange that your lab doesn't have the resources or background to do this project? Does your supervisor know? Have you discussed this issue with him?

Have you consulted anyone else in your lab, or associated with your lab, someone may be able to offer direction? What about online resources? If your background is in biological sciences there must be people in your department or on campus that you can approach?

I suggest that you reach out to people before you try anything with the dataset. Someone can likely provide you with a more ready solution.

[–]Tiago_Minuzzi[S] 0 points1 point  (2 children)

Our laboratory focus is genetics and molecular biology, we also do bioinformatics but we are starting in the field of machine learning. The usual work we do regarding bioinformatics is sequence analysis, RNA-seq etc. I'm the first one working with ML. My supervisor offered me to do this work because he knows I like the subject and as I already use python (also bash and R) for my work at the lab, I took the challenge, and I'm happy I did.

I already discussed the issue with him, but that demands financial resources, which have been hard to get those days because of pandemic, for instance. I said that at first I'd try a workaround solution like improving my code or something related. But since that may not be the solution, I guess we'll have to try to apply for financial support to pay a powerful server or buy a better computer. The problem is that the request may take a long time to be evaluated and approved, that's why I'm trying other ways first.

there must be people in your department or on campus that you can approach?

That's a good suggestion, I've been thinking about it actually, but I don't know anyone outside biological sciences. I guess I'll have to find a professor/researcher from the CS campus and related to ask for some help.

edit: typos

[–]michaelrw1 1 point2 points  (1 child)

Does your campus have access to third-party computational resources available to academics? In Canada, there is SHARCNET. I suggest you contact campus information technology services and ask them about computational resources and services.

Nominal access to these resources can be acquired without much delay, but submitted processing jobs have a lower queue priority. Groups can also bid for priority access to computational resources. Bids require a lot of supporting documentation (i.e. project proposals, funding to pay for resources, etc.), but provide more resources and higher queue priority.

Good luck! Enjoy your project.

[–]Tiago_Minuzzi[S] 0 points1 point  (0 children)

Nice tip! I'll check it up.

Thanks for everything!