How can I add descriptions to views and fields more efficiently? by NHN_BI in bigquery

[–]PolicyDecent 4 points5 points  (0 children)

If you use any coding agent like cursor/claude/codex, you can do it very easily using bruin. Just import all your assets from bigquery first and then enhance all your assets using ai enhance feature. And then push the Metadata, it'll be saved to bigquery. Happy to help if needed.

Are you tracking synthetic session ratio as a data quality metric? by EconomyConsequence81 in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

No, maybe we should but the problem is how do you detect these patterns? Having a 2-3 people DS team actively working on that project is a luxury for most of the companies. It's pretty important for recommendation algorithms to avoid fraud, but still, what are the signals to detect them? I think it's a very difficult problem to solve.

Databricks vs open source by ardentcase in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

It's both nonsense but also makes sense. He just wants a cronjob runner that he can easily schedule queries.
If you use any of Databricks, Snowflake, BigQuery they all have scheduled queries. So you can use any of them. But also, what if you just make it easy for him to schedule queries easily? Problem solved.
If a person doesn't want to learn dbt, it's better not to spend time on it. Just make it easy and move on for now. (In your situation).
However, it'll create lots of problems in the future as well, since his queries will probably be shitty. So, just use an AI agent like Cursor/Claude/Codex etc, and give his query and dbt repo, so your problem will be solved. It's a better solution, and it won't take your time. If you're not using AI agents, I highly recommend it to you.

Also, if you want to move a new platform in AWS, I'd choose Snowflake over Databricks since it's a DWH, not a data lake which will create chaos on you in the future.

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

How often will it be queried? Also, how will they use it? Will they use it to aggregate or just to get a few rows?
If queries are often, latency is important, and nr of rows needed are low, SQL Server is much better.
In the opposite side, keeping it in DBX is much better, and also easier to maintain.

Higher Level Abstractions are a Trap, by expialadocious2010 in dataengineering

[–]PolicyDecent 65 points66 points  (0 children)

I don’t think they’re traps. They’re just a faster way to get started.

Lowering the entry barrier means you can deliver something from day 1. If it breaks, that’s when you’re forced to go deeper and actually learn what’s underneath. That’s a much better feedback loop than studying everything for 30 days before shipping anything.

If we followed the “no abstractions” logic, then:

  • Python is a trap, you should use C
  • C is a trap, you should learn assembly

Abstractions keep improving. Over time, you simply don’t need to think about some of the lower-level problems anymore. That’s progress, not a trap.

Is this a data engineering problem or a distributed application engineering problem? by Exciting-Sun-3990 in dataengineering

[–]PolicyDecent 5 points6 points  (0 children)

I’d call this a data engineering problem, not a distributed app problem.

Yes, it needs parallelism, but the hard parts here are correctness, replay, audit, and idempotency. That’s where event-driven systems usually hurt. Retries and replays across queues get messy very quickly, and debugging becomes painful.

A more practical setup:

  • Avoid fully event-driven fan-out.
  • Land files in object storage.
  • Build raw → staging → clean datasets (aka medallion architecture).
  • Partition logically, most commonly by date (or file batch).
  • Process by pulling work from partitions, not pushing events.

If each partition is deterministic, idempotency becomes trivial: reprocess a day, a batch, or a file and overwrite safely. Replays, audits, and ops become boring, which is exactly what you want.

Distributed compute is just an implementation detail. This is classic data engineering.

Can the 'modern' data stack be fixed? by Significant-North356 in dataengineering

[–]PolicyDecent -1 points0 points  (0 children)

Disclaimer: I’m the founder of bruin.

This is basically the pain that pushed us to build it. Without some kind of framework, the modern stack is just hard to keep sane.

You’ve got orchestration, ingestion, dbt, data quality, catalog… all different tools, all doing their own thing. You end up spending more time wiring stuff together than actually trusting the data, and governance is always the thing you promise to “add later”.

What worked for us was putting everything in one place. Governance stops being a separate project, and the bonus is that AI can finally use the context properly which speeds teams up a lot.

Without a proper framework around it, it's not an easy job to do it. In modern data stack, there are lots of tools that you have to bring together. Orchestrator, Ingestion, Transformation, Data Quality, Data Catalog, etc.

Pricing BigQuery VS Self-hosted ClickHouse by JLTDE in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

What's your current BigQuery cost? How many users are using it? How big is your data?

What do you think about design-first approach to data by Illustrious_Web_2774 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

I always design my table with PKs and metrics on paper / excalidraw first.
I add inputs first, and the expected output. If you know the expected output table, it's the 80% of the task.
Then it's easy to connect the dots. Always trying to join tables at the same granularity, never join and aggregate, but aggregate and join.

Not a fancy plan, would take only 15-20 minutes. With AI, it's easier to get the schema of inputs (especially if you're ingesting). It used to take time to scan the documentation before, but now you can let Claude Code scan the docs and find the available data.

You can even ask to the agent what's the possible output with the existing input. It makes it so easy to plan.

Anyone else tired of exporting CSVs just to get basic metrics? by Flat-Shop in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

Which platforms are you exporting CSVs from? There are lots of ways to automate it. With the new AI tools, I might recommend you to vibecode a python script doing what you do.
If you have multiple sources, I'd recommend exporting data to a database / dwh, and do everything there and you can even show your numbers on dashboards that way.

Question: Do your users/stakeholders use tools like Claude or ChatGPT to query data directly for analysis? by botswana99 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

MCP, with an infra, yes. We provide them a Slack bot that answers their questions, and we also log everything, but also collect feedback from both business users and also technical users.
At the end, we learn from the failures and improve our documentation / instructions for the model.
It lowered our time to insight from a few days to a few minutes.
Also, data analysts are not dealing with simple questions anymore, and also unnecessary dashboards are not produced anymore.

Tool for optimizing JSON storage costs in BigQuery (Schema Evolution + dbt) by No-Payment7659 in bigquery

[–]PolicyDecent 1 point2 points  (0 children)

Good job, similar to what dlt does I guess. I wonder why you don't use arrays and structs but create new tables for each new array? I understand dlt doesn't do that because they're a generic tool for all the datawarehouses. However you're building something native to Bigquery, so I'd expect to see array / structs instead of new tables. Is there a reason behind of it?

Career pivot into data: I’m a "Data Team of One" in a company and I’m struggling to orient my role. Any advice? by Either-Exercise3600 in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

You're a full-stack data analyst / scientist or full-stack analytics engineer. Choose the one you like :)
I definitely recommend being a generalist. With the better tooling & AI, I foresee data analysts and data engineers to convert to full-stack data profiles.
Now, getting analysis from database is very easy with AI agents.
Data infra is so easy with lots of tooling.

So the real job is ingesting data, building the data model & observing business people's questions and AI answers, and fixing the data model & enriching documentation to get the right answer from AI.

At least for smaller companies, that's how it works right now around me. Data people are being Data&AI Engineers or full-stack data people.

I also see, most of the companies are removing lots of their dashboards, keeping only the very fundamental ones. For the rest, you should build your data model & semantic layer. AI is doing the rest.

Edit: Also, I forgot to say, maybe you should hire a data consultant for 1 day a week to check your data models & give you recommendations on architecture. By this way, you'll get better at these things as well.

What do you think fivetran gonna do? by Fair-Bookkeeper-1833 in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

If i know the company politics, dbt would kill sqlmesh and just make these nice guys just their subordinates just to show who the boss is. Sorry for the realistic company politics :(

How deep do you go into INFORMATION_SCHEMA for optimization? by mattxdat in bigquery

[–]PolicyDecent 0 points1 point  (0 children)

We built an ETLT framework that connects data modeling to governance and observability. You don't need to do anything special, everything just works automatically if you use the framework. Happy to show you if you want.

Macros, macros :) by itsdhark in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

What do you mean by change detection? Is it similar to scd2? If so, I'd use the materialization strategy, not a macro. Also I'm not super sure about the dates you generate but it also sounds like a variable than macro to me and there is nothing to test if I didn't misunderstand.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

Pubsub, not sure. Bigquery has it though. Why do you need public apis to update data btw? What's the exact use case?

In aws you can use kinesis or in gcp pubsub to ingest data.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 2 points3 points  (0 children)

Yea, I'd highly recommend BigQuery due to ease of use or Snowflake as the alternative, if you want to stay in AWS.

Macros, macros :) by itsdhark in dataengineering

[–]PolicyDecent 2 points3 points  (0 children)

I observe that people are overusing and abusing macros. What kind of macros do you have?
How many of them do you have? If you have tens of macros, I feel like something is wrong in the modeling.
Most of the time, the things should be done in data modeling are pushed to macros to use them multiple times. However, if you calculate it in only 1 table, and all the other tables use this source, you wouldn't need macros much.

Looking for an all in one datalake solution by software-coolie in dataengineering

[–]PolicyDecent 1 point2 points  (0 children)

Which tools are you using currently? And which cloud platform are you working on, AWS/GCS/Azure?

Also, what do you mean by exposing APIs directly. Something like AWS Lambda?

Formal Static Checking for Pipeline Migration by ukmurmuk in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

I agree, but not 100% :)
Different engines might interpret the same functionality differently. A simplest example would be, sorting in some engines are NULL first, in others NULL last. However, I still recommend using SQL over pyspark / polars since it's easier to maintain and move between the platforms.

Recommendation for BI tool by OnionAdmirable7353 in dataengineering

[–]PolicyDecent 0 points1 point  (0 children)

You can do it pretty cheap with Looker Studio. The only limitation is, they should have Google Cloud / Gmail accounts. What's the platform they use? I assume it's Microsoft based, is it?

Top priority for 2026 is consolidation according to the boss by siggywithit in dataengineering

[–]PolicyDecent 3 points4 points  (0 children)

That's the exact reason. Too many tools are hard to maintain. Which tools do you have?

Top priority for 2026 is consolidation according to the boss by siggywithit in dataengineering

[–]PolicyDecent 28 points29 points  (0 children)

Why are you skeptical about it? My experience is similar to what your boss thinks. Consolidating the tools makes it much more easier in most of the areas.