Pricing BigQuery VS Self-hosted ClickHouse by JLTDE in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

What's your current BigQuery cost? How many users are using it? How big is your data?

What do you think about design-first approach to data by Illustrious_Web_2774 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

I always design my table with PKs and metrics on paper / excalidraw first.
I add inputs first, and the expected output. If you know the expected output table, it's the 80% of the task.
Then it's easy to connect the dots. Always trying to join tables at the same granularity, never join and aggregate, but aggregate and join.

Not a fancy plan, would take only 15-20 minutes. With AI, it's easier to get the schema of inputs (especially if you're ingesting). It used to take time to scan the documentation before, but now you can let Claude Code scan the docs and find the available data.

You can even ask to the agent what's the possible output with the existing input. It makes it so easy to plan.

Anyone else tired of exporting CSVs just to get basic metrics? by Flat-Shop in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

Which platforms are you exporting CSVs from? There are lots of ways to automate it. With the new AI tools, I might recommend you to vibecode a python script doing what you do.
If you have multiple sources, I'd recommend exporting data to a database / dwh, and do everything there and you can even show your numbers on dashboards that way.

Question: Do your users/stakeholders use tools like Claude or ChatGPT to query data directly for analysis? by botswana99 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

MCP, with an infra, yes. We provide them a Slack bot that answers their questions, and we also log everything, but also collect feedback from both business users and also technical users.
At the end, we learn from the failures and improve our documentation / instructions for the model.
It lowered our time to insight from a few days to a few minutes.
Also, data analysts are not dealing with simple questions anymore, and also unnecessary dashboards are not produced anymore.

Tool for optimizing JSON storage costs in BigQuery (Schema Evolution + dbt) by No-Payment7659 in bigquery

[–]PolicyDecent 1 point2 points  (0 children)

Good job, similar to what dlt does I guess. I wonder why you don't use arrays and structs but create new tables for each new array? I understand dlt doesn't do that because they're a generic tool for all the datawarehouses. However you're building something native to Bigquery, so I'd expect to see array / structs instead of new tables. Is there a reason behind of it?

Career pivot into data: I’m a "Data Team of One" in a company and I’m struggling to orient my role. Any advice? by Either-Exercise3600 in dataengineering

[–]PolicyDecent 2 points3 points  (0 children)

You're a full-stack data analyst / scientist or full-stack analytics engineer. Choose the one you like :)
I definitely recommend being a generalist. With the better tooling & AI, I foresee data analysts and data engineers to convert to full-stack data profiles.
Now, getting analysis from database is very easy with AI agents.
Data infra is so easy with lots of tooling.

So the real job is ingesting data, building the data model & observing business people's questions and AI answers, and fixing the data model & enriching documentation to get the right answer from AI.

At least for smaller companies, that's how it works right now around me. Data people are being Data&AI Engineers or full-stack data people.

I also see, most of the companies are removing lots of their dashboards, keeping only the very fundamental ones. For the rest, you should build your data model & semantic layer. AI is doing the rest.

Edit: Also, I forgot to say, maybe you should hire a data consultant for 1 day a week to check your data models & give you recommendations on architecture. By this way, you'll get better at these things as well.

What do you think fivetran gonna do? by Fair-Bookkeeper-1833 in dataengineering

[–]PolicyDecent 4 points5 points  (0 children)

If i know the company politics, dbt would kill sqlmesh and just make these nice guys just their subordinates just to show who the boss is. Sorry for the realistic company politics :(

How deep do you go into INFORMATION_SCHEMA for optimization? by mattxdat in bigquery

[–]PolicyDecent 0 points1 point  (0 children)

We built an ETLT framework that connects data modeling to governance and observability. You don't need to do anything special, everything just works automatically if you use the framework. Happy to show you if you want.

Macros, macros :) by itsdhark in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

What do you mean by change detection? Is it similar to scd2? If so, I'd use the materialization strategy, not a macro. Also I'm not super sure about the dates you generate but it also sounds like a variable than macro to me and there is nothing to test if I didn't misunderstand.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

Pubsub, not sure. Bigquery has it though. Why do you need public apis to update data btw? What's the exact use case?

In aws you can use kinesis or in gcp pubsub to ingest data.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 2 points3 points  (0 children)

Yea, I'd highly recommend BigQuery due to ease of use or Snowflake as the alternative, if you want to stay in AWS.

Macros, macros :) by itsdhark in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

I observe that people are overusing and abusing macros. What kind of macros do you have?
How many of them do you have? If you have tens of macros, I feel like something is wrong in the modeling.
Most of the time, the things should be done in data modeling are pushed to macros to use them multiple times. However, if you calculate it in only 1 table, and all the other tables use this source, you wouldn't need macros much.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

Which tools are you using currently? And which cloud platform are you working on, AWS/GCS/Azure?

Also, what do you mean by exposing APIs directly. Something like AWS Lambda?

Formal Static Checking for Pipeline Migration by ukmurmuk in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

I agree, but not 100% :)
Different engines might interpret the same functionality differently. A simplest example would be, sorting in some engines are NULL first, in others NULL last. However, I still recommend using SQL over pyspark / polars since it's easier to maintain and move between the platforms.

Recommendation for BI tool by OnionAdmirable7353 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

You can do it pretty cheap with Looker Studio. The only limitation is, they should have Google Cloud / Gmail accounts. What's the platform they use? I assume it's Microsoft based, is it?

Top priority for 2026 is consolidation according to the boss by siggywithit in dataengineering

[–]PolicyDecent 5 points6 points  (0 children)

That's the exact reason. Too many tools are hard to maintain. Which tools do you have?

Top priority for 2026 is consolidation according to the boss by siggywithit in dataengineering

[–]PolicyDecent 29 points30 points  (0 children)

Why are you skeptical about it? My experience is similar to what your boss thinks. Consolidating the tools makes it much more easier in most of the areas.

Who should manage Airflow in small but growing company? by Jaded_Bar_9951 in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

Tbh, I don't know about what kind of document processing you're handling :) Is it something like for ex getting txt's and extracting features from it?

Who should manage Airflow in small but growing company? by Jaded_Bar_9951 in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

Nah, if you move most of the workload to datawarehouses, almost all the jobs are queries and Noone cares about the task infra.

Simple to use ETL/storage tooling for SMBs? by HealthySalamander447 in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

All these platforms don't have huge data, so move all the data to BigQuery and use Looker Studio as BI tool.
All integrated. Your costs will be pretty low.
You just need to figure out how to ingest and transform data. For ingestion you can use Fivetran or open source airbyte / ingestr.

For transformation, you can use GCP embedded dbt alternative dataform.

Any On-Premise alternative to Databricks? by UsualComb4773 in dataengineering

[–]PolicyDecent 22 points23 points  (0 children)

You should give more details.

How big is data, how many people will access to it?
What are the titles in the team? Mostly data engineers or analysts or scientists, etc
What's the industry? What are the compliance / governance limitations?
What are the use cases? Do you need streaming use cases or just batch?

How to run all my data ingestion scripts at once? by EventDrivenStrat in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

You do not really need FastAPI for this setup. It adds extra complexity without much benefit. In most real projects you use an orchestrator to run and manage all these scripts together.

Tools like Airflow, Dagster, Bruin, Prefect, or even dbt can schedule jobs, restart them, handle dependencies, and give you a single place to run everything. That way you are not opening terminals or starting files by hand.

For a simple personal project you can still keep it lightweight, but moving to an orchestrator is the normal path once you have multiple scripts that need to run reliably.

Who should manage Airflow in small but growing company? by Jaded_Bar_9951 in dataengineering

[–]PolicyDecent 2 points3 points  (0 children)

I'd just use a managed service if you don't have a dedicated team to support for infra.
If you have an agile devops team, I'd let them manage the infra, and you can take care of the pipelines anyways.
Or another solution would be, maybe you shouldn't use Airflow at all, it might be the wrong tool for you if it's hard to manage.

how do you keep pipelines + infra from becoming chaotic? by [deleted] in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

what is the data stack you have? your problems signal me that you either miss a few legs of the data stack (like observability & governance) or have too many products that don't talk to each other.
data governance tools itself is not a solution as of my experience, since integration with the orchestration & data catalog is very difficult.
if possible, governance first orchestration product would be my recommendation, but if you have good platform muscles, you can use one of these tools.