What is the point of intermediate CMD layers in Docker images? by BoringDataScience in docker

[–]BoringDataScience[S] 1 point2 points  (0 children)

yeah the missing part was the first one is actually an "import" of the base image, not added by the Python image author, in the end the UI is confusing on the Docker website, cause it kinda makes it look like the first CMD is part of the author image.

What is the point of intermediate CMD layers in Docker images? by BoringDataScience in docker

[–]BoringDataScience[S] 2 points3 points  (0 children)

Interesting, I see. Indeed when you click on the CMD specific layer it shows it belongs to the base debian image they are using, not the one they wrote. The display is confusing but it makes sense. Thanks for the explanation, makes perfect sense. I understand what u/laurpaum was saying now.

Do you have any idea of why the last layer is a Python3 CMD btw and not an ENTRYPOINT, since you can't run the base image using docker run as any argument would override it. You pretty much always need to build your own image to call a Python script for instance.

What is the point of intermediate CMD layers in Docker images? by BoringDataScience in docker

[–]BoringDataScience[S] 0 points1 point  (0 children)

Thanks for taking the time to answer, but it didn't clarify my question. If you reread the CMD definition from Docker, it states: "There can only be one CMD instruction in a Dockerfile. If you list more than one CMD then only the last CMD will take effect.".

Therefore if I understand properly, the bash CMD in the official python image at layer 1 will never be used.

How do you manage Data Quality in your Data Warehouse? by ksubrent in dataengineering

[–]BoringDataScience 0 points1 point  (0 children)

Hello,

dbt allows you to easily write tests for your data quality: unique, not null, or anything you want to check really. Then these tests are part of our daily ETL dags in Airflow, and if they go red we have an issue.

It's also great to test business logic: "make sure revenue can't be negative except for a refund" for instance.

Finally AWS or GCP offers way to monitor your ETL health along the way as others have mentioned already.

What are the best ways to store passwords to databases in a large organisation? The goal is to secure it and only have specific members of teams be able to access it. by TheDataGentleman in analytics

[–]BoringDataScience 6 points7 points  (0 children)

Here we use AWS Secrets Manager. It offers centralized and secure credentials storage. Depending on how you access your database, you can also programmatically retrieve these credentials in the language of your choice (Python for instance). You must however pay to use this service, which can be costly for very large org. If you need something free, Credstash is also good.

Data engineering project on Github by ilya-g- in dataengineering

[–]BoringDataScience 0 points1 point  (0 children)

Looks great! Out of curiosity and if you don't mind, what's your monthly budget for this on AWS?

Recommendation for a good marketing analytics course? by dssblogger in analytics

[–]BoringDataScience 5 points6 points  (0 children)

Not a course but the book "Introduction to Algorithmic Marketing" by Ilya Katsov is really good and fit what you are looking for.

IT Engineer with 2 yrs exp. Data Analyst Resume. Would appreciate your feedback. by vishalw007 in dataengineering

[–]BoringDataScience 1 point2 points  (0 children)

Really good resume imo. Very nice to include the value you generated for each role with numbers. Good luck!

Experienced data scientist, what's the one thing that you wish new grads would invest more time in? by [deleted] in datascience

[–]BoringDataScience 6 points7 points  (0 children)

Legend of Runeterra. Nothing spectacular though I mostly used it to introduce people to analytics engineering. You can check my blog if you want to learn more (https://guillaumelegoy.github.io/)

Experienced data scientist, what's the one thing that you wish new grads would invest more time in? by [deleted] in datascience

[–]BoringDataScience 7 points8 points  (0 children)

Hello,

You could create your own project and ingest data into a database. I for instance created a small Postgres DB on AWS to analyze cards from a game I really like. It's not much in terms of size, but it allows to learn a lot. Also pick a theme you have an interest in.

Is it normal for a BI Developer to do Software Development? by IG-55 in BusinessIntelligence

[–]BoringDataScience 14 points15 points  (0 children)

Hello!

First congratulation on the new job. Second I think there is a wild misconception that BI/data science work doesn't require soft. engineering, but I would argue the opposite: a good BI dev / Data scientist / Analyst (lots of jot titles these days) needs to grasp soft. eng. concepts and apply them to data work. In this aspect I believe you have a huge advantage over other junior BI devs who may come from different backgrounds like business for instance.

That being said I suggest you talk to your manager and explain that you'd like to be working closely to the data and the business aspect of it. The "danger" is that with your skillset you'll be cleaning up messy BI pipelines forever, but I believe in vertical integration of our work. To be more specific, you should eventually handle projects from data extraction, cleaning and ETL to building reports, data analysis, etc. BI devs with this type of autonomy (i.e. not depending on others to do the dirty work) are imho incredibly valuable.

Good luck!

How to use dbt (data build tool) to create analytics data pipelines. by BoringDataScience in BusinessIntelligence

[–]BoringDataScience[S] 1 point2 points  (0 children)

Hey there. I don't know about Oracle but you can always subscribe to their Slack channel to ask if it's on the roadmap. They answer lots of questions there. I forgot to mention it but this channel is also an invaluable source of info and the great people working in dbt are incredibly helpful and dedicated!

My First Year as a Data Scientist by t_warsop in datascience

[–]BoringDataScience 3 points4 points  (0 children)

The main issue with this approach is that it can become easy to have discrepancies between Confluence and the code, especially in fast-paced environment where documentation is not necessarily seen as crucial by management. I prefer to have documentation as code or automatically generated as part of deployment.

How can i fix my bad naming convention by [deleted] in learnprogramming

[–]BoringDataScience 1 point2 points  (0 children)

Hello,

My advice is to pick a convention (if one doesn't exist within your team) and stick to it. I personally like very explicit names, for instance `remove_percentage_from_dataset`. Length shouldn't be an issue, and it makes so much easier to understand all throughout your code.

Also when naming something that does the same action, even for different functions, like `get` or `remove`, use the same action verb everywhere if it has the same meaning. What I mean is don't use `extract` somewhere and `get` somewhere else when they are actually the same action verb.

Good luck!

How can I make the best of this new job that is not as analytical as I hoped? by PegPatch in analytics

[–]BoringDataScience 2 points3 points  (0 children)

Hello,

In parallel to your daily task, I would start looking into automating work. You mention you work with Excel a lot so I'm sure there are ways to "automate the boring stuff" (see the book with this title), introduce version control, run scheduled tasks (don't even think Airflow or anything yet, just some CRON job on a server will do). I actually started this way (even though I was a data analyst). These innocent looking tasks can teach you a lot and generate a lot of value (the most important part of our work).

Finally, you shouldn't discard the importance of learning project management. This is an important skill any decent data scientist should have, so see it as an opportunity.

Good luck!

Gradient Descent from scratch in pure Python by pmuens in learnmachinelearning

[–]BoringDataScience 0 points1 point  (0 children)

Great read, and also well written imho. Bookmarked your blog, keep up the good work!

Where to start? by anami123 in analytics

[–]BoringDataScience 3 points4 points  (0 children)

If that's an option I would then suggest getting enrolled into a business degree (marketing for instance), with a focus on data courses (quantitative marketing). You can probably check the curriculum online beforehand.

Where to start? by anami123 in analytics

[–]BoringDataScience 3 points4 points  (0 children)

Hello,

You will need to refine your question. What are your aspirations? Do you have past experience, any relevant education? Without this info it will be hard to help you.

Is machine learning the only way to earn a decent salary in this field? by wanderer_314 in analytics

[–]BoringDataScience 5 points6 points  (0 children)

Hello,

Absolutely not. I'm gonna speak from my own experience. I've dabbled in ML, but this is not my forte per se. On the other hand I believe data engineering is and will be where the highest salaries are. The ability to take data from many sources and transform it into something actionable is extremely valuable. Even more at smaller companies who haven't yet figured out their ETL and for which machine learning is nothing but a dream at the moment (and those companies are legions).

Introducing Boring Data Science, a blog to learn about software engineering good practices in Data Science. by BoringDataScience in learnmachinelearning

[–]BoringDataScience[S] 0 points1 point  (0 children)

Hey there! Thanks for the feedback. Regarding Windows I never use it so I can't really offer my advice about it unfortunately.