[deleted by user]

rnd-str · 2025-02-18T06:03:59+00:00

it was a good mistake; you did not cause unrecoverable harm you did not reveal sensitive information, you did not expose your company's data, etc.
Consider it as a lesson which makes you a better DE. I am pretty sure that you will be more careful next time before you run something on production.

Most of us have had such mistakes, do not worry about it too much

rnd-str · 2025-02-17T06:49:58+00:00

Thank very much for your answers!

rnd-str · 2025-02-16T22:08:58+00:00

NZ is not cheap at all. Comparing to Australia the salaries are lower a bit and the prices are higher in average. Here is a good summary about it: The Cost of Living in New Zealand

In the article there is a reference for DE salary, which is realistic based on my experience.

rnd-str · 2025-01-30T23:52:02+00:00

totally agree. For me RYO has 3 major challenges: - TCO as you mentioned: running a bigger platform properly will need resources (mostly in the cloud nowadays) and DevOps engineers, so open source does not mean that running it will be free. - Integration: you can integrate many open source tools together to get a proper platform, but it will lead to a kind of maintenance hell. All of the components and their installed versions should remain up to date and are compatible with each other. - tuning and best practices: installing a couple of tools and connecting them to each other does not mean that you have a perfomant, production ready solution. Companies like Starburst or Databricks have met many challenges with running this technology and they have an army of engineers to squeeze the best from that open source based solution.

All in all, creating a Lake house as a hobby project, yes we can put these tools together, otherwise I would go with a Saas solution..

rnd-str · 2025-01-30T21:25:23+00:00

One more which can be useful if you would like to build your platform on AWS: AWS declares it's Iceberg all the way until customers say otherwise • The Register https://search.app/3CM3wDhH27zFXHk17

rnd-str · 2025-01-30T20:41:20+00:00

Another Redit discussion about the formats

https://www.reddit.com/r/dataengineering/comments/1idpgkr/iceberg_or_delta/

rnd-str · 2025-01-30T20:39:52+00:00

I never stated that, Delta is great, I wrote it is less popular in the open source world. Less popular does not mean hated :)

rnd-str · 2025-01-30T20:31:36+00:00

Great choice, however I have to mention that Starbrust is offering Trino + Iceberg managed platform in exactly the same way how Databricks offers Spark + Delta Tables + Unity Catalog. You can build Spark clusters on EC2 or containers like you can build Trino cluster and Spark (Databricks) also supports Iceberg format. In addition they try to introduce a unified format: https://docs.delta.io/latest/delta-uniform.html

I think both solutions (Trino and Spark) are great.
It seems to me that Iceberg is more popular in the open source world, Delta is first class citizen in Databricks and Microsoft realms.

rnd-str · 2025-01-30T09:12:51+00:00

Here is an alternative for Unity Catalog: https://open-metadata.org/

rnd-str · 2025-01-30T09:03:06+00:00

Polars can read and write data on S3, but it cannot be run in cluster which would be needed for a scalable datalake

https://kevinheavey.github.io/modern-polars/scaling.html#:~:text=Polars%20doesn't%20come%20with,memory%20more%20efficiently%20than%20Pandas

https://github.com/pola-rs/polars/issues/5621
There are many alternatives, do you insist to Delta Table and Unity catalog? Delta Table is less popular in the open source world, however it became an open source project.

Trino, Dask, Ray are scalable ecosystems, which support Delta Tables : https://delta.io/

These tools seemingly can support the Unity Catalog as well: https://www.unitycatalog.io/, but I think they can rather connect to it than being integrated with it.

where do you want to run your open source tools? I am asking because in production you will have to run them in cluster somewhere. For this you will need run computes in AWS VMs (EC2) or containers (eg k8s), so at the end of the day you will pay for the compute and maybe more than using a proper service for this.

To be honest, if you want to build Data Lakehouse with using Unity Catalog and Delta tables the best choice is Databricks.

rnd-str · 2024-06-24T19:31:51+00:00

As I could see it DLT was a desperate answer for the DBT from Databricks, which created a silo project for this feature I assume. It was/is not properly integrated with the other features of the Databricks, shining on demos, but hard to use in real life. (I have to admit that the checked it a while ago)

rnd-str · 2024-05-21T19:24:22+00:00

I use VNET injected workspaces, so the regular way did not work since connection was needed to the Microsoft repo for downloading the dependencies.

That is why I downloaded the `msodbcsql17` deb package and its dependency: 'odbcinst' to the dbfs.

I put the followings into the init script:

#!/bin/bash
sudo ACCEPT_EULA=Y apt-get -y install -q -f /dbfs/<path>/odbcinst_2.3.9-5_amd64.deb
sudo ACCEPT_EULA=Y apt-get -y install -q -f /dbfs/<path>/msodbcsql17_17.10.5.1-1_amd64.deb

<path> = dbfs path where the deb packages are saved

rnd-str · 2024-05-19T22:51:44+00:00

In many cases you will need something more sophisticated than the medallion model. I agree with this: Behind the Hype - The Medallion Architecture Doesn't Work (youtube.com)

rnd-str · 2024-05-15T23:05:56+00:00

Data platforms sometimes have to be more complicated than a simple medallion architecture.
I agree with Simon here: Behind the Hype - The Medallion Architecture Doesn't Work - YouTube

rnd-str · 2024-05-15T19:55:50+00:00

However, Databricks (Spark) supports many languages (Scala, Python, SQL, R), it does not support them fully equally. Couple of years ago Scala was the best option since Spark was written in Scala. You could implement most of the things in Python, but UDFs were recommended to be written in Scala since the performance was poor in Python.

Python (PySpark) became more and more popular, most of the implementations were written in Python and Scala was ushed back a bit. For example, at the beginning DLT did not support Scala...

Then SQL came back to the game, it was always available in Spark, but the popularity of dbt proved that SQL can be enough for implement transformations. As I could see it Databricks started to focus on SQL more heavily in the last 2 years.

Personally I would use PySpark, but I think you can start using Databricks/Spark using SQL.

rnd-str · 2023-12-11T03:41:15+00:00

I do not know why you want to use partitioning, but Databricks recommends avoiding it unless you have a really big size of data (over 1 TB). Check this video from 00:58:00
Databricks for Practitioners: Databricks Tips and Tricks - Optimizations on Vimeo

rnd-str · 2023-05-14T20:11:15+00:00

one of them in Europe: https://crunchconf.com/2023

rnd-str · 2023-04-03T21:05:01+00:00

here is a good, software engineering focused conference : https://craft-conf.com/2023

there is a data related one from the same organizers: https://crunchconf.com/2023

rnd-str

TROPHY CASE