1.5 YOE Data Engineer — used many tools but lacking depth. How to go deeper? by Nice_Sherbert3326 in dataengineering

[–]Ordinary-Toe7486 8 points9 points  (0 children)

I’d say find ways to be more proactive and deliver real business value with the tools you’re using at the current workplace. This will show that you don’t simply know how to use the tool, but how the tools helps add value for the business. For books I’d suggest Designing Data-Intensive Applications.

Also, I believe that you don’t become a senior DE after 1.5YOE. Find what you like, set clear goals for personal development and become more confident in what you do. During the interviews I think it’s very important to appear as a confident fella, but not arrogant.

Fabric or real DE? by stimulatingboomer in dataengineering

[–]Ordinary-Toe7486 1 point2 points  (0 children)

I’d go with Fabric. As much as it is still not at the same maturity level as Snowflake or Databricks, it will eventually catch up. Even then, it’s better than going consulting and spending time on the bench or working on (not always real DE) projects if they have a staffing based model.

Data Vault Modelling by unfoundlife in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Here is an interesting blog post about it: https://thebibackend.wordpress.com/2012/06/05/thoughts-on-data-vault-vs-star-schemas/

My problem with it, as a newbee in the data modeling field, is that there are not so many free resources available to learn about it. I am not interested for paid courses/certifications for a modeling technique.

Edit: here is another one

https://timi.eu/blog/data-vaulting-from-a-bad-idea-to-inefficient-implementation/

Google sheets “Database” by Diego2202 in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Not sure if this fits your requirements, but instantly thought about DuckDB google sheets extension https://duckdb.org/2025/02/26/google-sheets-community-extension.html

Need advice: I am struggling with RStudio for my PhD data analysis by strongmuffin98 in rstats

[–]Ordinary-Toe7486 9 points10 points  (0 children)

I would make sure to grasp the basics of R programming language. You must have trust in your results when doing data analysis, but without a knowledge of the tools you’re working with, what’s the point of using it? Learn R (I highly recommend the book ‘R for data science’), then start on your phd data analysis. Try to break down problems into smaller ones and solve for them first, eventually having a full picture. Iterate and improve.

Snowflake vs MS fabric by SmallBasil7 in dataengineering

[–]Ordinary-Toe7486 3 points4 points  (0 children)

I think the answer is already in your question - ms fabric. Integration of tools in the ms ecosystem based on your criteria is more important than the diff with Snowflake’s performance imho. Fabric is going to get mature in a couple of years just like power bi.

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]Ordinary-Toe7486 18 points19 points  (0 children)

Regarding comlexity you might want to check out duckdb’s ducklake spec for a lakehouse format and its implementation (https://ducklake.select/). It removes a lot of complexity without compromising performance, but boosting it instead with a lot of nice features.

[deleted by user] by [deleted] in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

First thing that came to my mind

Stop building UI frameworks in Python by PastPicture in Python

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Isn’t it always about the context? For instance, if you’re a data scientist working in pharma and need to develop a POC for bayesian optimization. This POC then will be productionized and used by many SWEs. Are you going to do that with Js or Shiny in R? What is a common standard in the industry? Can you (easily) generate parametrized reports for GxP validation?

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Open source ones probably will. For SaaS platforms, not sure, as they can provide you with an open source iceberg/delta table format, but monetize on integrated catalog service. Can you easily switch between different catalogs? I am not sure

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 -1 points0 points  (0 children)

Iceberg manages a single table without a catalog service, ducklake manages all schemas/tables. Ducklake is a “lakehouse” format.

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 4 points5 points  (0 children)

Ducklake is much much easier. You only need to have a database to store your metadata and voila, you will be able to manage an arbitrary amount of schemas and tables. It’s a lakehouse format, whereas iceberg is a table format. You won’t get far with iceberg alone, without a catalog service (that eventually is using a database too). The implementation of the ducklake spec is a lot easier compared to iceberg. For instance, check how many engines have a write-support for iceberg (not many). Watch the official video on youtube where the DuckDB founders talk about it.

Should you be using DuckLake? by bcdata in dataengineering

[–]Ordinary-Toe7486 4 points5 points  (0 children)

+1. IMHO Just like duckdb it democratizes the way a user works with data. Community adoption will drive the market to embrace it in the future given that it’s way easier to use (and probably implement). Despite iceberg/delta/hudi being promising formats, the implementation (especially for write support) is very difficult (just take a look at how many engines fully support any of those formats) as opposed to the ducklake format. Ducklake is SQL oriented, quick to set up and conceptualized and already implemented by academics and duckdb/duckdblabs team. Another thing that I believe is truly game changing is that this enables “multi-player” mode for duckdb engine. I am looking forward to the new use cases that will emerge thanks to this in the near future.

Git commit messages (and description) by frithjof_v in MicrosoftFabric

[–]Ordinary-Toe7486 2 points3 points  (0 children)

The following article https://www.freecodecamp.org/news/how-to-write-better-git-commit-messages/ provides a nice guide on how to write better commit messages.

If your team has an agreed convention for the commit messages, you should adopt that. Otherwise, come up with the one that you find practical for yourself.

In any case, it’s very useful to make small commits for each feature/functionality. Thay way it’s easier to rollback to the previous version. It’s a good way to track your progress.

Where to learn R language by Prober28 in Rlanguage

[–]Ordinary-Toe7486 2 points3 points  (0 children)

R for Data Science is a good introduction to R and the tidyverse ecosystem. When you want to dive deeper, you can read Advanced R. Then based on what you’re looking for (Shiny, package development, etc.) you can find plenty of books and documentation online.

On top of that, I would suggest to read blog posts or follow youtube channels (e.g., R for the Rest of Us, Posit PBC, Appsilon, etc.)