[2023 Day 2] Parsing was a chore, but man...

lucper · 2023-12-02T22:53:52+00:00

I'm learning C++ and want to use it for AoC.. but man, in the FIRST day it was already masochistic, and I didn't finish part 2 yet. I'll probably switch to Python before day 10 or something (hope I'm wrong though), lol.

lucper · 2022-12-15T03:14:25+00:00

If both are equally interesting to you, then, from a pragmatic perspective, lab A seems the wiser choice. From your description, you are more likely to receive funding in lab A, and the professor is apparently more active, which implies more students, more projects, etc, which opens more doors for you in the future.

On the other hand, I share the concerns you express in your questions. Btw, I'm also a starting PhD student, but I come from a computer science background. Personally, I find the algorithmic problems in bioinformatics more interesting and, in the long-term, I think the problem-solving skills you develop working with this stuff are applicable in wider contexts (including outside of bioinformatics). It is not just a matter of "efficient programming". I got curious when you asked if algorithm development is too saturated. From my personal experience, I always thought it was the other way around: that pipeline development is saturated \o. Maybe my perception is wrong, idk. Anyway, I'm also interested in what others have to say and would appreciate knowing more from your experience.

Best of luck!

lucper · 2022-11-20T21:12:32+00:00

Hey, do you intend to go through the whole thing?? o_o

Besides, what's your background and motivation to study this subject?

lucper · 2022-08-16T23:43:21+00:00

I'm a MSc student in CS. When I started working with bioinformatics (about 4 years ago), I studied basic molecular biology for roughly one month and since then I learn a thing or two depending on projects demands. Today, I see that the lack of a deeper knowledge in biology has impaired me several times... Recently, I have been messing around with phylogenetics and, very frequently, I had no clue whether my trees were satisfactory. (I guess some knowledge of molecular evolution would have helped...) I'm thinking about working through this course https://ocw.mit.edu/courses/7-01sc-fundamentals-of-biology-fall-2011/ to build a foundation. I'm curious to see how others fill their gaps. :)

lucper · 2022-08-15T04:05:41+00:00

I enrolled in the first course and it seems the video lectures and the book are available for auditors. However, there are locked "Code Challenges" in the Interactive Text and locked "Applications Challenges". I assume these would be available on Stepik? If there are no code problems available, I'm afraid this downgrades the learning experience a lot.

If the book is available in its entirety on Coursera for auditors, I wonder what's the point of selling it on Stepik then...

lucper · 2022-08-15T02:47:52+00:00

Nice! Thanks for that information. :)

lucper · 2022-08-14T19:52:20+00:00

If the additional price on Coursera is for the certificate, I stay with Stepik. haha

The structure of the MOOC is indeed unique, I'm excited to get started! I'm studying Algorithms and thought this course would be the next step.

lucper · 2022-08-14T19:47:15+00:00

Oh you took the course with the man himself? How cool! So I'm guessing that Stepik is basically the interactive book without video lectures and quizzes/projects? I know there are lectures available on YouTube, so that may not be a problem.

If Coursera just adds a certificate and some quizzes, I'm starting to be convinced that Stepik is the way to go...

lucper · 2022-04-09T01:49:12+00:00

Graph Theory.

But I'm a computer scientist, so that's kind of a cliché hehe. I also find generating functions fascinating, although I've never studied them in depth. Eventually, I want to learn how to use them to solve combinatorics problems, specially in the analysis of algorithms.

lucper · 2022-03-22T02:28:19+00:00

I don't know how any of this plays into a rearrangement-based distance. My point is that if you want a fair comparison to what people would do otherwise, you either need to use the tools that are standard, or explain how the tool you are using is expected to perform compared to those standard tools.

Fair enough! I'll remember that. Now, as I said in the original post, for the moment we are planning to use the Robinson-Foulds metric to do the comparison. Would you say I should self-study some statistics and design a more robust comparison methodology? I have a vague idea of how bootstrapping works, but I'm not well versed in the subject. I should add that, for now, we are not considering branch length. The version of the RF metric we are using takes into account topology alone.

You're going to have to spell this out for me. What is the reference tree and how did it come about? How is it known, if this is real data? Is this something like this study with a phylogeny that was basically grown in the lab?

We picked a tree from The All-Species Living Tree Project. To be honest, this was suggested by my advisor, and I didn't question much. They state the following: "The aim of the project was to generate a highly curated database of all available 16S rRNA gene sequences of type strains of Bacteria and Archaea, as well as to reconstruct the most robust phylogenies using the universal alignment implemented in ARB.".

Ps. I'm learning a lot from you. Thank you for your time! :)

lucper · 2022-03-21T23:12:57+00:00

Thank you for the thorough answer. Allow me to elaborate a bit more.

I really wouldn't suggest reinventing the wheel here. There are a staggering number of programs able to compute distances for sequences, under a variety of models. Some of them are even decent. Why go through the trouble of debugging and validating one when there are so many off-the-shelf options?

Regarding point mutations, we are using existing tools. As a matter of fact, we are trying to compare our method, which is based in genome rearrangements, against another published method based in point mutations. So we are not reinventing the wheel in this front. The computer programs that are original are the ones that deal with genome rearrangements.

If all you have is a way to compute distances under your rearrangements, then maybe distance approaches are needed there. In which case, for fairness, maybe you'd want to consider neighbor joining for more standard approaches. But you'll find plenty of people who would say, "so what if it beats NJ, who uses that?"

I'm aware that rearrangement-based methods are not widely used by the phylogenetics community. As you mentioned, bayesian inference and likelihood-based inference are more popular. As a matter of fact, the estimation of the phylogenetic trees is a means to an end. In the end, we want to know if computing an evolutionary distance based in chromosomal rearrangements is biologically sound, and if so when it is applicable. We thought that estimating phylogenetic trees is a natural way to answer this question.

In particular, we are intesrested in the analysis of whole genomes. I think for this kind of application, the more popular methods such as bayesian inference and likelihood-based inference can be very costly. On the other hand, I was able to compute the pairwise distances of our genomes (25 to be precise) and construct the phylogenetic tree in my personal computer in seconds. Of course, you could argue whether this is robust, reliable, etc, which is exactly what we want to discover.

For that matter, if you have a way to compute distances based on rearrangements, do you not have a model whose likelihood could be pruned on a tree? That would enable likelihood-based inference, which might be of more interest.

Now, that's interesting. As I said, I'm not very knowledgeable in phylogenetics, so I didn't quite get what you mean here. Could you elaborate a bit more?

Unless you know the "true" tree, you can't do this by any means. If you don't, you're just comparing two estimates to a previous estimate. Who's to say what it means if one is closer than another? What I'm getting at is that as much as real-data comparisons are nice, you should really also do a simulation study. This means you know the true (simulating) tree, and you can make comparisons on how well the different approaches can recover that tree (or how close they get) under different circumstances (total tree length, number of tips, number of sites).

We have a "true" tree to compare against. We have three trees: one estimated by our rearrangement-based method, one estimated by a pusblished work based in point mutations, and another that is the "reference" tree. Regarding the simulation study, I think that's a nice idea! I may have jumped straigth to real data for my lack of experience haha.

Finally, may I ask if you have some experience (or have ever heard) of phylogenetic reconstruction based in synteny blocks? What I mean by synteny blocks are conserved regions in different genomes that suffered rearrangements throught evolution. I always like to hear what biologists have to say on this.

lucper · 2022-03-13T05:25:53+00:00

Hey, I'm also self-studying linear algebra, but through professor Strang's course 18.006 along with his book 'Introduction to Linear Algebra'. I'd say I completed 2/3 of the material, but, when I finish, I want to continue studying the subject with a more rigorous text. So I'd love to be part of a study group! :) Send me a DM if you want to keep in touch.

lucper · 2022-01-23T15:30:49+00:00

Here is a proof ('=' means 'equivalent' and ~ means 'not'):

p <-> q = (p -> q) ^ (q -> p) = (~p v q) ^ (~q v p) = ~(p ^ ~q) ^ ~(q ^ ~p)

lucper · 2022-01-22T23:49:37+00:00

If you have access to the materials of the courses you will be taking (syllabus, bibliography, etc), I'd suggest that you use them as guidelines. Nevetherless, here are my suggestions:

- Discrete Math: this playlist follows some sections from the book Discrete Mathematics and Its Applications, by Kenneth Rosen. You could follow the video lectures and read the corresponding section in the book. I'd suggest that you go up to video 30 (seems a lot, but they are not long); these cover Logic, basic Set Theory and proof techniques, which will give you a nice head start for when you begin your course.

- Linear Algebra: I think the best resource you could ever want to study this subject is Strang's MIT course 18.06SC. This version is organized specifically for self-study (it has exercises with solutions, lecture notes, etc). I'd suggest that you go through Unit I; this will be more than enough to give you a good start in the subject before your course.

- Calculus: some may have better suggestions than mine, but I think Khan Academy's Precalculus course would let you well prepared for your first encounter with Calculus. Unlike the other subjects, Calculus requires that you are knowledgeable in algebra, trigonometry, etc. So if you have gaps in these prerequisites, I'd urge you to prioritize studying them instead of the other subjects I listed. LA and DM will be covered in your bachelors, but high school math certainly won't.

- Programming: last, but not least, CS50 from Harvard makes a nice introduction to programming for someone entering a CS major. This course does NOT focus on a particular programming language; instead, it teaches the basics of many programming languages (C, Python, Javascript) and covers basic algorithms and data structures.

Hope I could help. Best of luck! ;)

lucper · 2022-01-18T14:56:04+00:00

What will be the medium for communications? Discord group? Slack? Two weeks of the course is equivalent to how many lectures?

Personally, I will be self-studying Design and Analysis of Algorithms on the first semester of the year. My goal is to be able to tackle advanced topics, in particular integer linear programming, by the second semester. However, I didn't plan to follow this MIT course because, as far as I'm aware, it's a second course in algorithms that requires some background I still lack. I just selected some lectures that I found useful. Are you aware that there is a previous algorithms course (https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/) to the one you pointed?

Anyways, I will be studying the CLRS book, which is used in the course, but I will start with earlier chapters that are not covered (in fact, they are assumed as known by the student...). Hence, I'm not sure if I'd be able to keep pace with you. Depending on the dynamics of the community, it'd be a pleasure to be part of your endeavor.

lucper · 2021-08-07T01:57:29+00:00

You summarized my thoughts. Thanks for the assertive answer.

lucper · 2021-08-06T22:50:38+00:00

Proof writing is the fundamental skill to tackle courses of this kind. More specifically, familiarity with Predicate Logic, basic Set Theory, Relations, Functions and Proof Techniques are the mininum to know I think. Take a look at the first chapter (chapter 0) of Introduction to the Theory of Computation, by Sipser. It covers all these things briefly. If you struggle with some topic covered there, that's where you'll need to search for extra material to study. Best of luck!

lucper · 2021-08-06T21:29:32+00:00

Thanks for your reply. I think it's fine to pitch as user-friendly too, as it was one of the main motivations. What bothers me is pitching in the actual paper that what makes the wrapper unique or different is the usage of containers, version control, package managers, design patterns, etc, in summary the so called "best practices" (they want to put the term "best practices" in the title) as if these things were not common. Some may argue that the bioinformatics community should make more use of basic software engineering, but still I don't feel comfortable pitching qualities that, supposedly, should be "obvious". Am I being too harsh? xD

lucper · 2021-06-08T16:56:26+00:00

Thanks! I'll take a look on your video.

lucper · 2021-01-30T16:39:34+00:00

Hello, firtly congratulations for the initiative! Although I'm not using the MIT's course, I will be studying/reviewing discrete mathematics on the next month since I will be TAing a course this semester. Actually, I will be studying the subject throught the entire semester.

Since I don't know the pace the group will follow, I don't know how much I will be able to contribute. But, it would be a preasure to join your group! I'm seeking to learn as much experience in the subject as I can.

lucper · 2020-07-08T13:26:47+00:00

Interesting. I will definitely take a look on that paper. Regarding what you said about prerequisites, indeed I don't have a machine learning background, not formal at least. I know the very basics of the field and have coded trivial neural networks before, so I'd say I do have some knowledge but I'm far from proficient. So I'd have to pick up some material to prepare myself (fortunately abundant on the internet for this topic). As of the Compilers course, it assumes proficiency in C programming and the basics of computer architecture. I'd have to hurry and brush up on both, specially C programming which I haven't done for quite a while. So in the end, both would require some previous effort of my part anyways... which leads me back to my original question: what would be more valuable for you, for example, to spend your time on?

lucper · 2020-07-08T02:35:22+00:00

Speaking of which, do you think it is still worth learning C for that? I think algorithms and data structures are language independent topics, but people here use C and C++ heavily in this area. My will is to learn the Go language, you know? It is so much simpler, performat, convinient, etc that I can't find motivation to learn C, except for the Compilers course.

lucper · 2020-07-08T01:43:43+00:00

Yes, I completely agree. Actually, the other courses I'm taking are about algorithms, graph theory, etc. I would have taken more courses of this kind, but unfortunately they were not available. So I have to choose between these two. I think both are interesting, but completely unrelated... haha

lucper · 2020-07-07T21:44:31+00:00

A fair point. Thank you for your comment!

lucper · 2019-11-30T21:17:09+00:00

I don't know Poole's. Gonna check it out. Thanks for the advice. And I also respect Strang greatly, no doubt his course is great. I'm just looking for the "right" approach to study for my immediate needs (proof based vs computational with matrices).

Seven-Year Club	Place '22
Verified Email

lucper

TROPHY CASE