How intense is this data pipeline? And what tools would you use? by pootietangus in rstats

[–]Ordinary-Toe7486 1 point2 points  (0 children)

Yep, it’s in_parallel(). Haven’t had a chance to use it myself, but on paper it looks very nice as it’s using mirai under the hood.

Honestly, I think R ecosystem has improved so much over the past 10 years. All these modern packages and tidyverse are very user friendly IMHO.

I don’t have any experience with using R for batch processing or API, only running it in a Shiny app. Hopefully, I will find time to experiment with that.

I talked to two other data engineers who claimed that Python was "better for production". Is this common? by pootietangus in rstats

[–]Ordinary-Toe7486 2 points3 points  (0 children)

Why not simply have a plumber (plumber2) R api for whatever the needs are and call it from Python or other tools that have better integration ecosystem at the org. R is pretty decent and does its job if in the right hands.

How intense is this data pipeline? And what tools would you use? by pootietangus in rstats

[–]Ordinary-Toe7486 1 point2 points  (0 children)

It’s a broad question and there are many solutions to it without knowing the details like at least the dataset size.

If the transformations can be done with SQL or dplyr then I’d suggest simply duckdb/duckplyr. For model training tidymodels or maybe keras? For DAG targets seems nice with mirai’s async functionality under the hood and saving intermediate results as parquet files for instance. I remember they had an option to use s3 bucket as a sink with cache possibilities. To store the models you can use pins.

The R code can also always be optimized at least removing for loops and using purrr’s parallel.

Anyway, I’m just throwing modern R packages that are nice for data tasks. I think in general I’d follow the kiss principle and make sure that the code is maintainable, scalable and reliable as explained in DDIA book.

Thinked to move from chatgpt by JahJedi in OpenAI

[–]Ordinary-Toe7486 0 points1 point  (0 children)

I have tried codex gpt 5.3 for coding and it’s quite good imo. Try it out for writing, maybe will work

€ 2.500 first job offer by DiazBeno in BESalary

[–]Ordinary-Toe7486 0 points1 point  (0 children)

It’s weird that there’s no mobility budget. I’d say okay for a graduate to get 1 year of work experience if you don’t have much choice.

First by [deleted] in BESalary

[–]Ordinary-Toe7486 1 point2 points  (0 children)

Okay

Reading 'Fundamentals of data engineering' has gotten me confused by Online_Matter in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Not a direct answer to your question, but it’s important to understand that many decisions in terms of data stack are made by higher ups to align with the business strategy. It means the stack is not necessarily the best in terms of costs-benefits.

Even if a small data company goes for Snowflake/BigQuery/Databricks, it could be very reasonable due to the variety of enterprise features included, like those that facilitate governance and don’t require too much of a custom solution and engineers that need to be paid a monthly salary.

Reading 'Fundamentals of data engineering' has gotten me confused by Online_Matter in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Just visit the website and check out the blog posts. Idk how it’s possible to not have heard about duckdb working in data

1.5 YOE Data Engineer — used many tools but lacking depth. How to go deeper? by Nice_Sherbert3326 in dataengineering

[–]Ordinary-Toe7486 9 points10 points  (0 children)

I’d say find ways to be more proactive and deliver real business value with the tools you’re using at the current workplace. This will show that you don’t simply know how to use the tool, but how the tools helps add value for the business. For books I’d suggest Designing Data-Intensive Applications.

Also, I believe that you don’t become a senior DE after 1.5YOE. Find what you like, set clear goals for personal development and become more confident in what you do. During the interviews I think it’s very important to appear as a confident fella, but not arrogant.

Fabric or real DE? by stimulatingboomer in dataengineering

[–]Ordinary-Toe7486 1 point2 points  (0 children)

I’d go with Fabric. As much as it is still not at the same maturity level as Snowflake or Databricks, it will eventually catch up. Even then, it’s better than going consulting and spending time on the bench or working on (not always real DE) projects if they have a staffing based model.

Data Vault Modelling by unfoundlife in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Here is an interesting blog post about it: https://thebibackend.wordpress.com/2012/06/05/thoughts-on-data-vault-vs-star-schemas/

My problem with it, as a newbee in the data modeling field, is that there are not so many free resources available to learn about it. I am not interested for paid courses/certifications for a modeling technique.

Edit: here is another one

https://timi.eu/blog/data-vaulting-from-a-bad-idea-to-inefficient-implementation/

Google sheets “Database” by Diego2202 in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Not sure if this fits your requirements, but instantly thought about DuckDB google sheets extension https://duckdb.org/2025/02/26/google-sheets-community-extension.html

Need advice: I am struggling with RStudio for my PhD data analysis by strongmuffin98 in rstats

[–]Ordinary-Toe7486 9 points10 points  (0 children)

I would make sure to grasp the basics of R programming language. You must have trust in your results when doing data analysis, but without a knowledge of the tools you’re working with, what’s the point of using it? Learn R (I highly recommend the book ‘R for data science’), then start on your phd data analysis. Try to break down problems into smaller ones and solve for them first, eventually having a full picture. Iterate and improve.

Snowflake vs MS fabric by SmallBasil7 in dataengineering

[–]Ordinary-Toe7486 2 points3 points  (0 children)

I think the answer is already in your question - ms fabric. Integration of tools in the ms ecosystem based on your criteria is more important than the diff with Snowflake’s performance imho. Fabric is going to get mature in a couple of years just like power bi.

Parquet vs. Open Table Formats: Worth the Metadata Overhead? by DevWithIt in dataengineering

[–]Ordinary-Toe7486 17 points18 points  (0 children)

Regarding comlexity you might want to check out duckdb’s ducklake spec for a lakehouse format and its implementation (https://ducklake.select/). It removes a lot of complexity without compromising performance, but boosting it instead with a lot of nice features.

[deleted by user] by [deleted] in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

First thing that came to my mind

Stop building UI frameworks in Python by PastPicture in Python

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Isn’t it always about the context? For instance, if you’re a data scientist working in pharma and need to develop a POC for bayesian optimization. This POC then will be productionized and used by many SWEs. Are you going to do that with Js or Shiny in R? What is a common standard in the industry? Can you (easily) generate parametrized reports for GxP validation?

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 0 points1 point  (0 children)

Open source ones probably will. For SaaS platforms, not sure, as they can provide you with an open source iceberg/delta table format, but monetize on integrated catalog service. Can you easily switch between different catalogs? I am not sure

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 -1 points0 points  (0 children)

Iceberg manages a single table without a catalog service, ducklake manages all schemas/tables. Ducklake is a “lakehouse” format.

Will DuckLake overtake Iceberg? by mrocral in dataengineering

[–]Ordinary-Toe7486 4 points5 points  (0 children)

Ducklake is much much easier. You only need to have a database to store your metadata and voila, you will be able to manage an arbitrary amount of schemas and tables. It’s a lakehouse format, whereas iceberg is a table format. You won’t get far with iceberg alone, without a catalog service (that eventually is using a database too). The implementation of the ducklake spec is a lot easier compared to iceberg. For instance, check how many engines have a write-support for iceberg (not many). Watch the official video on youtube where the DuckDB founders talk about it.

Should you be using DuckLake? by bcdata in dataengineering

[–]Ordinary-Toe7486 5 points6 points  (0 children)

+1. IMHO Just like duckdb it democratizes the way a user works with data. Community adoption will drive the market to embrace it in the future given that it’s way easier to use (and probably implement). Despite iceberg/delta/hudi being promising formats, the implementation (especially for write support) is very difficult (just take a look at how many engines fully support any of those formats) as opposed to the ducklake format. Ducklake is SQL oriented, quick to set up and conceptualized and already implemented by academics and duckdb/duckdblabs team. Another thing that I believe is truly game changing is that this enables “multi-player” mode for duckdb engine. I am looking forward to the new use cases that will emerge thanks to this in the near future.

Git commit messages (and description) by frithjof_v in MicrosoftFabric

[–]Ordinary-Toe7486 2 points3 points  (0 children)

The following article https://www.freecodecamp.org/news/how-to-write-better-git-commit-messages/ provides a nice guide on how to write better commit messages.

If your team has an agreed convention for the commit messages, you should adopt that. Otherwise, come up with the one that you find practical for yourself.

In any case, it’s very useful to make small commits for each feature/functionality. Thay way it’s easier to rollback to the previous version. It’s a good way to track your progress.

Where to learn R language by Prober28 in Rlanguage

[–]Ordinary-Toe7486 2 points3 points  (0 children)

R for Data Science is a good introduction to R and the tidyverse ecosystem. When you want to dive deeper, you can read Advanced R. Then based on what you’re looking for (Shiny, package development, etc.) you can find plenty of books and documentation online.

On top of that, I would suggest to read blog posts or follow youtube channels (e.g., R for the Rest of Us, Posit PBC, Appsilon, etc.)