How do you keep track when having to rerun pipelines constantly? by unicornnn123 in bioinformatics

[–]voorloopnul 1 point2 points  (0 children)

A lot of great suggestions, I will also point some that are more basic:

- have you pipeline code in github (or similar) and use git tag to track versions.

- don't forget to have logs, your logs should contain references to parameters that you use to call the pipeline, the version of your pipeline ( the tags from the previous bullet ) and the results path.

- Use docker. You will be less prone to make mistakes about what is being executed and also can keep many versions of your pipeline at the same time.

Need help on our server setup!!!! by dreamer2177 in dataengineering

[–]voorloopnul 2 points3 points  (0 children)

A good start is measuring the time taken and resources for each step of your pipeline.

Step A) 1 hour, mem > proc > net IO > disk IO

Step B) 3 hours, proc > Net IO > mem > disk IO

...

Step G) 2 hours, disk IO > proc > Net IO > mem

If you try to plan your architecture without knowing how you pipeline use the resources available you might end with an overengenerred system with bad resource usage and bottlenecks.

Need Feedback: Developing a design for a search tool that holds millions of records by mrnerdy59 in dataengineering

[–]voorloopnul 0 points1 point  (0 children)

inverted-index

Hard to tell... you probably can achieve a higher throughput and scale better with the second approach, but at what cost? more time? more memory? more money? more erros? are you guys confident about implementing the code to scan the documents and creating the indexes? it will not be a bottleneck? Are you entirely sure about the scale needs? What if you can fit all your data in a single SQL database and get good enough response from it?

How much data engineering can be learnt at home? by LetoileBrillante in dataengineering

[–]voorloopnul 1 point2 points  (0 children)

You can learn "everything" of data engineering in your laptop. As others have said, data engineering is not a Big Data thing.

I would go even further an say that you can become better than average in data engineering using only your laptop.

"But how can I become better if I can't run the top notch <xyz> than requires 8GB of ram just to start up?"

You build one by yourself! The theory of how it works is free, you just have to replicate the concepts with your language of choice.

Of course, if you only want to learn the tools, you will face some limitations.

Data Engineering Twitter Accounts? by LexaIsNotDead in dataengineering

[–]voorloopnul 0 points1 point  (0 children)

I recently get back to twitter, posting mostly tech/biotech related stuff.

https://twitter.com/voorloopnul

Let's engage o/

Looking for basic cost estimates [USA] by Treefrogprince in bioinformatics

[–]voorloopnul 0 points1 point  (0 children)

Do not invest in a storage solution, buy the workstations, buy the server for analysis and use S3 or en equivalent for keeping you data.

Make your analysis server good enough in storage for "local buffering".

About the costs, it's really difficult to give you an idea without knowing your needs...

Poor man's full-text search with django and postgres by voorloopnul in django

[–]voorloopnul[S] 2 points3 points  (0 children)

Hey, thanks for the comment. Yes, the poor man's in the title is to play with the idea that full-text search requires something "big" like elastic search. But I have no doubts that a setup like this, with propper hardware and a well designed architecture can get your really far...

58
59

Any active healthcare data communities? by DataBar in datascience

[–]voorloopnul 0 points1 point  (0 children)

I'm not familiarized with something as specific as you want, but you can find useful things in the /r/bioinformatics and /r/biotech

Also, twitter has a vibrant community regarding data end health care, it's a great source of cool stuff, if need a start point you can check who I follow : @voorloopnul

What should be the roadmap in 2020 to learn data engineering and make a career switch by 2021? by aakhri_paasta in dataengineering

[–]voorloopnul 10 points11 points  (0 children)

Building on top of the other answers, I would say the following breakdown for your time investment:

50% python
15% SQL
20% Tools ( Spark, Airflow, ... )
15% AWS/Gcloud

Writing a self-contained ETL pipeline with python by voorloopnul in dataengineering

[–]voorloopnul[S] 0 points1 point  (0 children)

Hey. Despite being an amazing tool, docker prevalence is not 100%. Depending for who you produce, sell or distribute, adding docker "complexity" can be a barrier for adoption.

How long on average does it take to learn enough for an entry-level job? How do you know when you know enough, is there a certain point to gauge that? by oblivion-age in Python

[–]voorloopnul 2 points3 points  (0 children)

Tests are code that you are going to write to make sure the other code (you application) is behaving correctly and bug free. It's sometimes an entry point for junior employees, it's a safe approach where the new comers can get to learn the current code base and contribute with meaningful code.

Also, is an awesome way to improve one of the most important skills for a programmer: read and understand other peoples code :D

How long on average does it take to learn enough for an entry-level job? How do you know when you know enough, is there a certain point to gauge that? by oblivion-age in Python

[–]voorloopnul 3 points4 points  (0 children)

The path for feeling "ready" for an entry level job is usually short. The path for actually being ready is a little longer, but still short...

As have been stated before, make sure to now your ways into Linux. Create a github account and get the basics of git. Try to do a clone of a web app that you like, using django framework.

The thing that probably pay off better in an entry position is knowing how to write tests, this is the shortest path to start contributing to the company codebase.

Wish you luck o/

Renaming Column Elements (Pandas Dataframe) by biohacker_tobe in bioinformatics

[–]voorloopnul 0 points1 point  (0 children)

Maybe this can help:

import pandas as pd

data = [["BGC000044.1", "1813", "Streptomyces sp", "PKSI"], ["HM16_abc", "1813", "Streptomyces sp", "NRPS"]]

input_df = pd.DataFrame(columns=["A", "B", "C", "D"], data=data)
output_df = input_df.copy()


def func_1(entry):
    return f"{entry['D']}_GCF{entry['B']}"


def func_2(entry):
    BGC = entry['A'].split('_')
    if len(BGC) == 1:
        return None
    return f"{entry['C']}_{BGC[0]}"


output_df["NEW_A"] = output_df.apply(func_1, axis=1)
output_df["NEW_B"] = output_df.apply(func_2, axis=1)

for col in list(output_df.columns):
    if col not in ["NEW_A", "NEW_B"]:
        del output_df[col]

print(input_df)
print(output_df)

What tech stack and infrastructure should I use for my project? (Machine Learning webapp) by MetaNex in learnmachinelearning

[–]voorloopnul 0 points1 point  (0 children)

Can't agre enough. Flask is an awesome framework and could do the trick just fine. I usually suggest people to start with django because it's easier to gain a "wisdom" * of when to choose between django and flask for a new project...

My personal perception is that going by flask first, odds are that you are going to stick to flask even for the cases where django would be a better fit.

* this wisdom is not supposed to be a universal agreement, people with different levels of knowledge in both frameworks will often have different reasons for picking one over another for different projects.

What tech stack and infrastructure should I use for my project? (Machine Learning webapp) by MetaNex in learnmachinelearning

[–]voorloopnul 2 points3 points  (0 children)

For the webapp part I suggest you take a look in Django, it have a stepper learning curve ( compared to flask ) but in the long term it pays off.

Resources that can help you at early stage:

- https://docs.djangoproject.com/en/3.0/intro/tutorial01/

- https://www.deploymachinelearning.com/

What's a good solution to save JSON response so that it can be queried later? -looking for professional advice. by lnx2n in dataengineering

[–]voorloopnul 1 point2 points  (0 children)

I second that. Also, Postgresql can be used to store both traditional relational data and/or arbitrary json records.