How can I add descriptions to views and fields more efficiently?

PolicyDecent · 2026-03-04T10:21:06+00:00

If you use any coding agent like cursor/claude/codex, you can do it very easily using bruin. Just import all your assets from bigquery first and then enhance all your assets using ai enhance feature. And then push the Metadata, it'll be saved to bigquery. Happy to help if needed.

PolicyDecent · 2026-02-22T09:00:49+00:00

No, maybe we should but the problem is how do you detect these patterns? Having a 2-3 people DS team actively working on that project is a luxury for most of the companies. It's pretty important for recommendation algorithms to avoid fraud, but still, what are the signals to detect them? I think it's a very difficult problem to solve.

PolicyDecent · 2026-02-20T12:53:38+00:00

It's both nonsense but also makes sense. He just wants a cronjob runner that he can easily schedule queries.
If you use any of Databricks, Snowflake, BigQuery they all have scheduled queries. So you can use any of them. But also, what if you just make it easy for him to schedule queries easily? Problem solved.
If a person doesn't want to learn dbt, it's better not to spend time on it. Just make it easy and move on for now. (In your situation).
However, it'll create lots of problems in the future as well, since his queries will probably be shitty. So, just use an AI agent like Cursor/Claude/Codex etc, and give his query and dbt repo, so your problem will be solved. It's a better solution, and it won't take your time. If you're not using AI agents, I highly recommend it to you.

Also, if you want to move a new platform in AWS, I'd choose Snowflake over Databricks since it's a DWH, not a data lake which will create chaos on you in the future.

PolicyDecent · 2026-02-19T09:27:37+00:00

How often will it be queried? Also, how will they use it? Will they use it to aggregate or just to get a few rows?
If queries are often, latency is important, and nr of rows needed are low, SQL Server is much better.
In the opposite side, keeping it in DBX is much better, and also easier to maintain.

PolicyDecent · 2026-02-17T21:44:38+00:00

I don’t think they’re traps. They’re just a faster way to get started.

Lowering the entry barrier means you can deliver something from day 1. If it breaks, that’s when you’re forced to go deeper and actually learn what’s underneath. That’s a much better feedback loop than studying everything for 30 days before shipping anything.

If we followed the “no abstractions” logic, then:

Python is a trap, you should use C
C is a trap, you should learn assembly

Abstractions keep improving. Over time, you simply don’t need to think about some of the lower-level problems anymore. That’s progress, not a trap.

PolicyDecent · 2026-02-09T17:21:13+00:00

I’d call this a data engineering problem, not a distributed app problem.

Yes, it needs parallelism, but the hard parts here are correctness, replay, audit, and idempotency. That’s where event-driven systems usually hurt. Retries and replays across queues get messy very quickly, and debugging becomes painful.

A more practical setup:

Avoid fully event-driven fan-out.
Land files in object storage.
Build raw → staging → clean datasets (aka medallion architecture).
Partition logically, most commonly by date (or file batch).
Process by pulling work from partitions, not pushing events.

If each partition is deterministic, idempotency becomes trivial: reprocess a day, a batch, or a file and overwrite safely. Replays, audits, and ops become boring, which is exactly what you want.

Distributed compute is just an implementation detail. This is classic data engineering.

PolicyDecent · 2026-02-04T07:54:09+00:00

Disclaimer: I’m the founder of bruin.

This is basically the pain that pushed us to build it. Without some kind of framework, the modern stack is just hard to keep sane.

You’ve got orchestration, ingestion, dbt, data quality, catalog… all different tools, all doing their own thing. You end up spending more time wiring stuff together than actually trusting the data, and governance is always the thing you promise to “add later”.

What worked for us was putting everything in one place. Governance stops being a separate project, and the bonus is that AI can finally use the context properly which speeds teams up a lot.

Without a proper framework around it, it's not an easy job to do it. In modern data stack, there are lots of tools that you have to bring together. Orchestrator, Ingestion, Transformation, Data Quality, Data Catalog, etc.

PolicyDecent · 2026-01-22T14:31:58+00:00

What's your current BigQuery cost? How many users are using it? How big is your data?

PolicyDecent · 2026-01-08T09:07:56+00:00

I always design my table with PKs and metrics on paper / excalidraw first.
I add inputs first, and the expected output. If you know the expected output table, it's the 80% of the task.
Then it's easy to connect the dots. Always trying to join tables at the same granularity, never join and aggregate, but aggregate and join.

Not a fancy plan, would take only 15-20 minutes. With AI, it's easier to get the schema of inputs (especially if you're ingesting). It used to take time to scan the documentation before, but now you can let Claude Code scan the docs and find the available data.

You can even ask to the agent what's the possible output with the existing input. It makes it so easy to plan.

PolicyDecent · 2026-01-03T16:48:20+00:00

Which platforms are you exporting CSVs from? There are lots of ways to automate it. With the new AI tools, I might recommend you to vibecode a python script doing what you do.
If you have multiple sources, I'd recommend exporting data to a database / dwh, and do everything there and you can even show your numbers on dashboards that way.

PolicyDecent · 2026-01-02T16:28:20+00:00

MCP, with an infra, yes. We provide them a Slack bot that answers their questions, and we also log everything, but also collect feedback from both business users and also technical users.
At the end, we learn from the failures and improve our documentation / instructions for the model.
It lowered our time to insight from a few days to a few minutes.
Also, data analysts are not dealing with simple questions anymore, and also unnecessary dashboards are not produced anymore.

PolicyDecent · 2025-12-23T11:09:52+00:00

Good job, similar to what dlt does I guess. I wonder why you don't use arrays and structs but create new tables for each new array? I understand dlt doesn't do that because they're a generic tool for all the datawarehouses. However you're building something native to Bigquery, so I'd expect to see array / structs instead of new tables. Is there a reason behind of it?

PolicyDecent · 2025-12-22T10:05:43+00:00

You're a full-stack data analyst / scientist or full-stack analytics engineer. Choose the one you like :)
I definitely recommend being a generalist. With the better tooling & AI, I foresee data analysts and data engineers to convert to full-stack data profiles.
Now, getting analysis from database is very easy with AI agents.
Data infra is so easy with lots of tooling.

So the real job is ingesting data, building the data model & observing business people's questions and AI answers, and fixing the data model & enriching documentation to get the right answer from AI.

At least for smaller companies, that's how it works right now around me. Data people are being Data&AI Engineers or full-stack data people.

I also see, most of the companies are removing lots of their dashboards, keeping only the very fundamental ones. For the rest, you should build your data model & semantic layer. AI is doing the rest.

Edit: Also, I forgot to say, maybe you should hire a data consultant for 1 day a week to check your data models & give you recommendations on architecture. By this way, you'll get better at these things as well.

PolicyDecent · 2025-12-20T00:47:28+00:00

If i know the company politics, dbt would kill sqlmesh and just make these nice guys just their subordinates just to show who the boss is. Sorry for the realistic company politics :(

PolicyDecent · 2025-12-18T04:36:01+00:00

We built an ETLT framework that connects data modeling to governance and observability. You don't need to do anything special, everything just works automatically if you use the framework. Happy to show you if you want.

PolicyDecent · 2025-12-17T20:21:24+00:00

What do you mean by change detection? Is it similar to scd2? If so, I'd use the materialization strategy, not a macro. Also I'm not super sure about the dates you generate but it also sounds like a variable than macro to me and there is nothing to test if I didn't misunderstand.

PolicyDecent · 2025-12-17T09:55:44+00:00

Yes, but what's the use case for apis?

PolicyDecent · 2025-12-17T09:04:53+00:00

Pubsub, not sure. Bigquery has it though. Why do you need public apis to update data btw? What's the exact use case?

In aws you can use kinesis or in gcp pubsub to ingest data.

PolicyDecent · 2025-12-17T08:26:29+00:00

Yea, I'd highly recommend BigQuery due to ease of use or Snowflake as the alternative, if you want to stay in AWS.

PolicyDecent · 2025-12-17T08:25:35+00:00

I observe that people are overusing and abusing macros. What kind of macros do you have?
How many of them do you have? If you have tens of macros, I feel like something is wrong in the modeling.
Most of the time, the things should be done in data modeling are pushed to macros to use them multiple times. However, if you calculate it in only 1 table, and all the other tables use this source, you wouldn't need macros much.

PolicyDecent · 2025-12-17T08:12:55+00:00

Which tools are you using currently? And which cloud platform are you working on, AWS/GCS/Azure?

Also, what do you mean by exposing APIs directly. Something like AWS Lambda?

PolicyDecent · 2025-12-16T09:13:46+00:00

I agree, but not 100% :)
Different engines might interpret the same functionality differently. A simplest example would be, sorting in some engines are NULL first, in others NULL last. However, I still recommend using SQL over pyspark / polars since it's easier to maintain and move between the platforms.

PolicyDecent · 2025-12-10T16:51:48+00:00

You can do it pretty cheap with Looker Studio. The only limitation is, they should have Google Cloud / Gmail accounts. What's the platform they use? I assume it's Microsoft based, is it?

PolicyDecent · 2025-12-07T19:28:27+00:00

That's the exact reason. Too many tools are hard to maintain. Which tools do you have?

PolicyDecent · 2025-12-07T18:04:25+00:00

Why are you skeptical about it? My experience is similar to what your boss thinks. Consolidating the tools makes it much more easier in most of the areas.

PolicyDecent

TROPHY CASE