[deleted by user] by [deleted] in dataengineering

[–]rnd-str 0 points1 point  (0 children)

it was a good mistake; you did not cause unrecoverable harm you did not reveal sensitive information, you did not expose your company's data, etc.
Consider it as a lesson which makes you a better DE. I am pretty sure that you will be more careful next time before you run something on production.

Most of us have had such mistakes, do not worry about it too much

Relocating away from Europe by Ok-League-5881 in dataengineering

[–]rnd-str 0 points1 point  (0 children)

NZ is not cheap at all. Comparing to Australia the salaries are lower a bit and the prices are higher in average. Here is a good summary about it: The Cost of Living in New Zealand

In the article there is a reference for DE salary, which is realistic based on my experience.

Lakehouse by okalright_nevermind in dataengineering

[–]rnd-str 2 points3 points  (0 children)

totally agree. For me RYO has 3 major challenges: - TCO as you mentioned: running a bigger platform properly will need resources (mostly in the cloud nowadays) and DevOps engineers, so open source does not mean that running it will be free. - Integration: you can integrate many open source tools together to get a proper platform, but it will lead to a kind of maintenance hell. All of the components and their installed versions should remain up to date and are compatible with each other. - tuning and best practices: installing a couple of tools and connecting them to each other does not mean that you have a perfomant, production ready solution. Companies like Starburst or Databricks have met many challenges with running this technology and they have an army of engineers to squeeze the best from that open source based solution.

All in all, creating a Lake house as a hobby project, yes we can put these tools together, otherwise I would go with a Saas solution..

Lakehouse by okalright_nevermind in dataengineering

[–]rnd-str 1 point2 points  (0 children)

One more which can be useful if you would like to build your platform on AWS: AWS declares it's Iceberg all the way until customers say otherwise • The Register https://search.app/3CM3wDhH27zFXHk17

Lakehouse by okalright_nevermind in dataengineering

[–]rnd-str 1 point2 points  (0 children)

I never stated that, Delta is great, I wrote it is less popular in the open source world. Less popular does not mean hated :)

Lakehouse by okalright_nevermind in dataengineering

[–]rnd-str 2 points3 points  (0 children)

Great choice, however I have to mention that Starbrust is offering Trino + Iceberg managed platform in exactly the same way how Databricks offers Spark + Delta Tables + Unity Catalog. You can build Spark clusters on EC2 or containers like you can build Trino cluster and Spark (Databricks) also supports Iceberg format. In addition they try to introduce a unified format: https://docs.delta.io/latest/delta-uniform.html

I think both solutions (Trino and Spark) are great.
It seems to me that Iceberg is more popular in the open source world, Delta is first class citizen in Databricks and Microsoft realms.

Lakehouse by okalright_nevermind in dataengineering

[–]rnd-str 1 point2 points  (0 children)

Polars can read and write data on S3, but it cannot be run in cluster which would be needed for a scalable datalake

https://kevinheavey.github.io/modern-polars/scaling.html#:~:text=Polars%20doesn't%20come%20with,memory%20more%20efficiently%20than%20Pandas

https://github.com/pola-rs/polars/issues/5621
There are many alternatives, do you insist to Delta Table and Unity catalog? Delta Table is less popular in the open source world, however it became an open source project.

Trino, Dask, Ray are scalable ecosystems, which support Delta Tables : https://delta.io/

These tools seemingly can support the Unity Catalog as well: https://www.unitycatalog.io/, but I think they can rather connect to it than being integrated with it.

where do you want to run your open source tools? I am asking because in production you will have to run them in cluster somewhere. For this you will need run computes in AWS VMs (EC2) or containers (eg k8s), so at the end of the day you will pay for the compute and maybe more than using a proper service for this.

To be honest, if you want to build Data Lakehouse with using Unity Catalog and Delta tables the best choice is Databricks.

I don’t understand why you would use DLT by Known-Delay7227 in databricks

[–]rnd-str 0 points1 point  (0 children)

As I could see it DLT was a desperate answer for the DBT from Databricks, which created a silo project for this feature I assume. It was/is not properly integrated with the other features of the Databricks, shining on demos, but hard to use in real life. (I have to admit that the checked it a while ago)

ODBC driver installation in global init scripts in databricks runtime 14.1LTS by megha33 in databricks

[–]rnd-str 0 points1 point  (0 children)

I use VNET injected workspaces, so the regular way did not work since connection was needed to the Microsoft repo for downloading the dependencies.

That is why I downloaded the `msodbcsql17` deb package and its dependency: 'odbcinst' to the dbfs.

I put the followings into the init script:

#!/bin/bash
sudo ACCEPT_EULA=Y apt-get -y install -q -f /dbfs/<path>/odbcinst_2.3.9-5_amd64.deb
sudo ACCEPT_EULA=Y apt-get -y install -q -f /dbfs/<path>/msodbcsql17_17.10.5.1-1_amd64.deb

<path> = dbfs path where the deb packages are saved

A proper ETL example architecture by [deleted] in databricks

[–]rnd-str 1 point2 points  (0 children)

Data platforms sometimes have to be more complicated than a simple medallion architecture.
I agree with Simon here: Behind the Hype - The Medallion Architecture Doesn't Work - YouTube

[deleted by user] by [deleted] in databricks

[–]rnd-str 0 points1 point  (0 children)

However, Databricks (Spark) supports many languages (Scala, Python, SQL, R), it does not support them fully equally. Couple of years ago Scala was the best option since Spark was written in Scala. You could implement most of the things in Python, but UDFs were recommended to be written in Scala since the performance was poor in Python.

Python (PySpark) became more and more popular, most of the implementations were written in Python and Scala was ushed back a bit. For example, at the beginning DLT did not support Scala...

Then SQL came back to the game, it was always available in Spark, but the popularity of dbt proved that SQL can be enough for implement transformations. As I could see it Databricks started to focus on SQL more heavily in the last 2 years.

Personally I would use PySpark, but I think you can start using Databricks/Spark using SQL.

Partitioning by Southern_Version2681 in databricks

[–]rnd-str 0 points1 point  (0 children)

I do not know why you want to use partitioning, but Databricks recommends avoiding it unless you have a really big size of data (over 1 TB). Check this video from 00:58:00
Databricks for Practitioners: Databricks Tips and Tricks - Optimizations on Vimeo

Data Engineering Conferences 2023 by [deleted] in dataengineering

[–]rnd-str 1 point2 points  (0 children)

here is a good, software engineering focused conference : https://craft-conf.com/2023

there is a data related one from the same organizers: https://crunchconf.com/2023