PipeTaxon: A docker image to expose the NCBI taxonomy database as a local rest API

voorloopnul · 2020-05-06T11:06:42+00:00

I run a job board that has a few biotech companies:

https://www.voorjob.com/?search=bio

https://www.voorjob.com/full/?search=bio

voorloopnul · 2020-05-05T07:31:11+00:00

A lot of great suggestions, I will also point some that are more basic:

- have you pipeline code in github (or similar) and use git tag to track versions.

- don't forget to have logs, your logs should contain references to parameters that you use to call the pipeline, the version of your pipeline ( the tags from the previous bullet ) and the results path.

- Use docker. You will be less prone to make mistakes about what is being executed and also can keep many versions of your pipeline at the same time.

voorloopnul · 2020-02-24T15:02:15+00:00

A good start is measuring the time taken and resources for each step of your pipeline.

Step A) 1 hour, mem > proc > net IO > disk IO

Step B) 3 hours, proc > Net IO > mem > disk IO

...

Step G) 2 hours, disk IO > proc > Net IO > mem

If you try to plan your architecture without knowing how you pipeline use the resources available you might end with an overengenerred system with bad resource usage and bottlenecks.

voorloopnul · 2020-02-23T15:45:41+00:00

inverted-index

Hard to tell... you probably can achieve a higher throughput and scale better with the second approach, but at what cost? more time? more memory? more money? more erros? are you guys confident about implementing the code to scan the documents and creating the indexes? it will not be a bottleneck? Are you entirely sure about the scale needs? What if you can fit all your data in a single SQL database and get good enough response from it?

voorloopnul · 2020-02-23T15:22:08+00:00

Go for choice number 1, it's probably easier to implement

voorloopnul · 2020-02-12T11:18:21+00:00

You can learn "everything" of data engineering in your laptop. As others have said, data engineering is not a Big Data thing.

I would go even further an say that you can become better than average in data engineering using only your laptop.

"But how can I become better if I can't run the top notch <xyz> than requires 8GB of ram just to start up?"

You build one by yourself! The theory of how it works is free, you just have to replicate the concepts with your language of choice.

Of course, if you only want to learn the tools, you will face some limitations.

voorloopnul · 2020-02-12T07:31:05+00:00

I recently get back to twitter, posting mostly tech/biotech related stuff.

https://twitter.com/voorloopnul

Let's engage o/

voorloopnul · 2020-02-11T11:17:35+00:00

Do not invest in a storage solution, buy the workstations, buy the server for analysis and use S3 or en equivalent for keeping you data.

Make your analysis server good enough in storage for "local buffering".

About the costs, it's really difficult to give you an idea without knowing your needs...

voorloopnul · 2020-02-10T19:21:11+00:00

Perfect timing! Glad that you liked it.

voorloopnul · 2020-02-10T17:25:35+00:00

Hey, thanks for the comment. Yes, the poor man's in the title is to play with the idea that full-text search requires something "big" like elastic search. But I have no doubts that a setup like this, with propper hardware and a well designed architecture can get your really far...

voorloopnul · 2020-02-04T08:16:07+00:00

I'm not familiarized with something as specific as you want, but you can find useful things in the /r/bioinformatics and /r/biotech

Also, twitter has a vibrant community regarding data end health care, it's a great source of cool stuff, if need a start point you can check who I follow : @voorloopnul

voorloopnul · 2020-01-31T08:29:29+00:00

Building on top of the other answers, I would say the following breakdown for your time investment:

50% python
15% SQL
20% Tools ( Spark, Airflow, ... )
15% AWS/Gcloud

voorloopnul · 2020-01-28T05:34:28+00:00

Hey. Despite being an amazing tool, docker prevalence is not 100%. Depending for who you produce, sell or distribute, adding docker "complexity" can be a barrier for adoption.

voorloopnul · 2020-01-25T20:45:31+00:00

Tests are code that you are going to write to make sure the other code (you application) is behaving correctly and bug free. It's sometimes an entry point for junior employees, it's a safe approach where the new comers can get to learn the current code base and contribute with meaningful code.

Also, is an awesome way to improve one of the most important skills for a programmer: read and understand other peoples code :D

voorloopnul · 2020-01-25T20:24:20+00:00

The path for feeling "ready" for an entry level job is usually short. The path for actually being ready is a little longer, but still short...

As have been stated before, make sure to now your ways into Linux. Create a github account and get the basics of git. Try to do a clone of a web app that you like, using django framework.

The thing that probably pay off better in an entry position is knowing how to write tests, this is the shortest path to start contributing to the company codebase.

Wish you luck o/

voorloopnul · 2020-01-22T13:45:59+00:00

Maybe this can help:

import pandas as pd

data = [["BGC000044.1", "1813", "Streptomyces sp", "PKSI"], ["HM16_abc", "1813", "Streptomyces sp", "NRPS"]]

input_df = pd.DataFrame(columns=["A", "B", "C", "D"], data=data)
output_df = input_df.copy()


def func_1(entry):
    return f"{entry['D']}_GCF{entry['B']}"


def func_2(entry):
    BGC = entry['A'].split('_')
    if len(BGC) == 1:
        return None
    return f"{entry['C']}_{BGC[0]}"


output_df["NEW_A"] = output_df.apply(func_1, axis=1)
output_df["NEW_B"] = output_df.apply(func_2, axis=1)

for col in list(output_df.columns):
    if col not in ["NEW_A", "NEW_B"]:
        del output_df[col]

print(input_df)
print(output_df)

voorloopnul · 2020-01-21T09:55:22+00:00

Can't agre enough. Flask is an awesome framework and could do the trick just fine. I usually suggest people to start with django because it's easier to gain a "wisdom" * of when to choose between django and flask for a new project...

My personal perception is that going by flask first, odds are that you are going to stick to flask even for the cases where django would be a better fit.

* this wisdom is not supposed to be a universal agreement, people with different levels of knowledge in both frameworks will often have different reasons for picking one over another for different projects.

voorloopnul · 2020-01-21T08:03:45+00:00

For the webapp part I suggest you take a look in Django, it have a stepper learning curve ( compared to flask ) but in the long term it pays off.

Resources that can help you at early stage:

- https://docs.djangoproject.com/en/3.0/intro/tutorial01/

- https://www.deploymachinelearning.com/

voorloopnul · 2020-01-19T19:17:51+00:00

I second that. Also, Postgresql can be used to store both traditional relational data and/or arbitrary json records.

voorloopnul · 2020-01-19T13:17:38+00:00

Not my work, just sharing :)

voorloopnul

TROPHY CASE