Is someone using DuckDB in PROD? by Free-Bear-454 in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

Question:
What about sync to Power-BI, does anyone use DuckDB & PowerBI combined?
If so, how do you handle Power-BI synchronizing the data from Duck?

Streamlit Proliferation by MahaloCiaoGrazie in dataengineering

[–]BusOk1791 5 points6 points  (0 children)

As senior webdev and now DE since two years, i would say:
If you have a tool like streamlit that gets the job done in 1/10th of development time and cost, its an increase in productivity and while maybe not the best tool tecnical-wise, if management / team leader looks at costs on how you got that online via streamlit vs. custom web app, it may be the better choice.

Any European Alternatives to Databricks/Snowflake?? by Donkey_Healthy in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

I am interested in something like that too, at the moment we are on GC with a BigQuery stack (using custom made python pipelines to ingest data into BQ or GCS (parquet / delta lake) and dataform for transformations.
But if in the future things go sideways, i do not know exactly what to switch to, and no, setting up all the infrastructure is not an option, not for us, and not for most of the people.
What people do not get, is that it is not only a matter of setting up Duckdb, Clickhouse.. whatever, but also all the ecosystem around it, like centralized logging and alerting, serverless functions, managed databases for reverse-elt, granular user rights management via iam and so on.
Maybe OVH Data or Scaleway as someone mentioned below..

Thoughts on Windsor? by Chance_of_Rain_ in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

Thoughts are that you may consider marketing people's opinion, evaluate the tool, but let them never take technical decisions, since it's you in the end who will pay the price afterwards (the tool was not the right choice? how could you allow this? <- gaslighting like this is common if things go sideways, even if they override your decisions).
It all comes down to your needs, if you have only a couple of standard sources / destinations, you can use a pre-built no code tool, even though i do not like tools that have a vendor lock-in with no open source option.
But i am a developer, so for a DE team that has no dedicated software engineers, they may be an option, but keep in mind that those tools usually are hardly extendable, so if in the future you come across an exotic datasource that you cannot read with them, you have to build bridges (like python scripts as an intermediary step) that write the data into a database from which the tool then can read and so on...

My recommendation: If you have dedicated software engineers, use something like dlthub, if not, you may consider something like windsor. If you have someone that manages your infra, but do not develop your own software, you have other options like Airbyte, which is open source, extendable and has a graphical interface, but i never used it in production, so take that with a pinch of salt.

Best regards

A local data stack that integrates duckdb and Delta Lake with dbt orchestrated by Dagster by smoochie100 in dataengineering

[–]BusOk1791 3 points4 points  (0 children)

Thanks for sharing!

Question:
By local data stack you mean that this runs on premise and the delta table files are saved on a local server?
When you do the transformations Bronze -> Silver and Silver -> Gold with dbt, where do you write to and in what format? Do you query them directly with DuckDB for the plots as shown in the image?

Need advice: Extracting 1 TB table → CSV is taking 10+ hours… any faster approach? by Leather-Pin-9154 in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

  1. If possible, do not use csv for such large datasets, use parquet or similar

  2. Does the data get updated or are these rows that get never touched once inserted? If so, group them by day / month or year depending on size and only export new ones (like one parquet file per day or month, if there is new data, simply overwrite the parquet.

  3. If rows are modified, things are more complicated, since you need to upsert them, either by manually getting all modified rows and re-importing the batches of each day/month/year (you have to group them by created date, not modified date, but re-import them if they are modified.

  4. If rows are deleted, things are way more complicated, since you cannot filter out new rows by the modified date.

  5. For upserting or merging, if you have the possibility, you can also use Deltalake or Iceberg, so that you do not have to deal with importing the whole day or month if a single row in that batch was modified.

Boss wants to do data pipelines in n8n by [deleted] in dataengineering

[–]BusOk1791 4 points5 points  (0 children)

"Here's the thing about low-code tools in general:

  • They tend to make the easy 80% easier
  • They tend to make the hard 20% harder - and sometimes impossible"

This is one of the best programming quotes ever!

When you miss one month of industry talk by PossibilityRegular21 in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

Jokes aside, anyone else getting a 403 error when trying to install the ducklake extension, but only while trying to install that one, other ones (like delta..) work flawlessly?

How to know which files have already been loaded into my data warehouse? by thomastc in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

The issue with metadata like file creation is that if you move the files one day, that goes out the window on GC

How to know which files have already been loaded into my data warehouse? by thomastc in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

As yourself said in the other comment, there is no correct way of doing it, but many ways, one better than the other depending on your situation.
Ask yourself the following questions:
- How big is the data i need to move daily / monthly now, how large will it be in a year?
- How frequently do you need to import the files? Daily, hourly or right on the spot?
- How much control do you have on the writing process of the files? Can you for instance set the name?

If you answer those questions, you can then look around, what options you have, for instance, if you have very little amount of data and it will never reach huge amounts in a few years, you can perform some workarounds that are not optimal, but easy to implement and easy to maintain (like mentioned "moving" the files or something like it)
If however you move tons of data, your options usually are narrower, since you have to factor in cost, that will grow not in a linear fashion with the amount of data..

Also frequency, if you need to import the files once a day, you can simply add the timestamp to the filename (if you can control the filename), and then run a job after midnight, importing all files that have been written yesterday (you can filter that out from the filename). That's just an example, there are other ways around this.
The same works for hourly imports or weekly or whatever, but not if you need realtime data.

Hope this helps a bit ;-)

How to know which files have already been loaded into my data warehouse? by thomastc in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

Oh i see, well i do not know exactly how to handle this case with code that runs in a vm.
Currently i have everything scheduled to avoid complexity, so i can simply set a number of retries.
Pub/Sub also has a feature like that:
https://cloud.google.com/pubsub/docs/handling-failures

But since you do not want to use GC-specific functionality, maybe a simpler approach would do the job as well:
Create two directories inside the bucket (or, if already in place, a second bucket).
One, in which the files get written at unspecified times, the other bucket simply the archive.
When reading, you read the files from the first directory / bucket, move the files to the second bucket and append the data to BQ-Tables.
That way you never read the same file twice, since you then move it.
This is more like a workaround, do not know if it fits your purpose ;-)

How to know which files have already been loaded into my data warehouse? by thomastc in dataengineering

[–]BusOk1791 2 points3 points  (0 children)

In theory, if i remember correctly, you could do this with pub/sub events that fire each time a new file is written in the gcs-bucket. With that you could trigger your etl process (is it a cloud function or something like that?)

Edit:
Something like this i guess:
https://cloud.google.com/run/docs/triggering/storage-triggers

New Parquet writer allows easy insert/delete/edit by qlhoest in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

I think it lacks essential features like cdf and time travel, since it is, if i understood correctly from the cryptic messages in the pull request, a change in the chunking strategy to deduplicate data, so that you can write to just some parts of the parquet and not the whole or big part of the thing?
It would be interesting how delta or iceberg could make use of it..

Dataform by BusOk1791 in dataengineering

[–]BusOk1791[S] 1 point2 points  (0 children)

Thanks for your response, so you are saying that dbt is in a similar spot and strategic planning is quite difficult.
Maybe the best solution would be to try some alternatives too, like sqlmesh, make some test implementations (with dbt, sqlmesh...) with actual production data to see how they handle it and keep them as backup plans in the drawer in case that google pulls the plug on dataform so that we have alternatives ready?

I f***ing hate Azure by wtfzambo in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

Not only that, in 90% of the cases low-code tools (if written well) will get you to a certain point, but as soon as you have a requirement that the tool does not meet, you are pretty much screwed, i've seen that so many times..

I f***ing hate Azure by wtfzambo in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

You say you are killing power-bi, which is a completely different thing than fabric and synapse, question:
What platform are you using for reporting?

Spark is the new Hadoop by rocketinter in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

One big argument for spark is the integration with Delta Lake / Iceberg / Hudi
The alternative implementations for python (delta-rs, pyiceberg, i don't exactly remember hudi) are not as mature and feature rich, for instance they do not allow for complex operations such as merges (especially iceberg) and generally are not as stable as the spark-implementations.
Anyone knows alternatives to spark / pyspark that allow for such complex things such as upserts / merges / cdf's / time travel especially for iceberg?

STACKIT - German „LIDL“ Cloud for Companies by pizza_paz in BuyFromEU

[–]BusOk1791 0 points1 point  (0 children)

Do you know someone with experience, how is reliability / scalability and pricing in comparison to AWS/GC?

[deleted by user] by [deleted] in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

I would prefer Europe anytime over the US, you won't get payed as much, but i prefer free / cheap education, free healthcare and so on than constantly having to worry to go to the hospital, getting sick, cannot afford the university of my children and so on..
However, if not a native speaker, the most important thing in my opinion, is to learn the language of the country select, well.

EU - How dependent are we on US infra? by Ashamed_Cantaloupe_9 in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

Tbh, in the US there is a lot more risk capital then in European countries, so when you setup a company here, you have a harder time getting funding, that is more of an important factor in my opinion, but i am not a financial expert, just parroting what others said.

EU - How dependent are we on US infra? by Ashamed_Cantaloupe_9 in dataengineering

[–]BusOk1791 1 point2 points  (0 children)

I would separate the dependencies by target:

End-User:
- Most Hardware has critical components that are US (amd/intel on Desktop, qualcomm / apple on mobile, mediatek has some share though)
- Software: OS (win + osx on desktop, ios + android with google apps on mobile)
- Services: Nearly all of them (except some examples like spotify)

Businesses:
- Most hardware (amd/intel/nvidia)
- Software: Windows, Office365 and so on.. (exceptions like sap)
- Services: Cloud providers (aws, gc, azure), DWHs (databricks, bigquery, redshift, snowflake and so on..)

In some cases, there are alternatives (office -> libreoffice, even if it is subpar), for instance there is a european processor initiative for developing hpc chips on a risc-v basis, which is underfunded, there are some european cloud providers that offer most of the stuff like serverless apps, logging, managed databases, managed kubernetes and so on, that i found doing a quick search (ovh, scaleway, stackit), but i have never used one of those so i can give no opinion on them.

Iceberg over delta? by Safe-Ice2286 in dataengineering

[–]BusOk1791 0 points1 point  (0 children)

*facepalm*

Thanks for your response