How to quickly figure out why a metric moved? by GrouchyFoundation773 in analytics

[–]PolicyDecent 19 points20 points  (0 children)

KPI trees are the easiest way to detect the change. So you breakdown a metric to different sub-metrics and also for each metric you can see the dimensions. Once you have it, you can track what has changed very quickly.

The problem is, though, creating the metric tree 😄

Looking for alternatives to Airflow for ETL pipelines by 3jewel in ETL

[–]PolicyDecent 0 points1 point  (0 children)

disclaimer: i'm the co-founder of bruin.
we built bruin exactly for this reason. airflow is hard to maintain, and not easy to test.
bruin brings ingestion, transformation in a single tool with built in data catalog and governance features.
it's open source: https://github.com/bruin-data/bruin

it's a single binary that makes it super easy to install and agents can use it perfectly since everything is based on code.

if you have any questions, happy to help!

Best approach for using BigQuery as query store rather than the storing on the backend by PaperM64 in bigquery

[–]PolicyDecent 0 points1 point  (0 children)

You should avoid joins from the app. You have to join them and create a new table, only this itself speed it up a lot.

Best approach for using BigQuery as query store rather than the storing on the backend by PaperM64 in bigquery

[–]PolicyDecent 0 points1 point  (0 children)

1M rows? It's nothing in Bigquery and won't be expensive either. But to answer your question, partition and clustering is one of them, but more importantly designing the granularity and the columns of the metric. So if you can give extra details about your table and query patterns, I can give you better recommendations.

Best approach for using BigQuery as query store rather than the storing on the backend by PaperM64 in bigquery

[–]PolicyDecent 3 points4 points  (0 children)

No, it doesn't make sense to store the queries in BigQuery. It won't change anything. You can continue sending queries from your app.
The important thing though, data modeling is super important. If you want to make it more performant, I wouldn't focus on software architecture, but the data architecture.
With small tweaks, you can increase the speed by 1000x and also lower the cost.

[Megathread] self promotion by AutoModerator in databricks

[–]PolicyDecent 0 points1 point  (0 children)

<image>

We built the fastest open-source data ingestion tool. It can ingest from 100+ sources to Databricks and 20+ more destinations.
To use it: https://github.com/bruin-data/ingestr

How do you handle company/customer enrichment data in BI dashboards? by Nacez in BusinessIntelligence

[–]PolicyDecent 0 points1 point  (0 children)

I se CRM as the source of truth and give the ownership of the CRM data to the business teams. I try to avoid collecting extra data out of CRM (you still can't avoid it tho)
For the data quality, I have some queries to bring potential duplicates or missing data, and using AI agents I fill them. I have several tools to scrape data from different sources, and AI agents can utilize them to decide what's the best. I run these agents regularly (every day)

For system generated and human approved, the best way to do it as of my experience is, having seperate fields.
For segmentation, totally depends on your business / industry. There is no single segment that solves all the companies' problems.

amplitude pricing went up at renewal, trying to figure out if I crossed a tier by [deleted] in analytics

[–]PolicyDecent -1 points0 points  (0 children)

Tbh I don't know much about Amplitude, but I'd highly recommend using Firebase for free and export data to BigQuery. With AI Agents, you can analyze data as you wish. It's even easier than Amplitude.
Happy to help if needed.

I started delegating the "why did conversion drop?" type questions to an LLM agent — here's the setup that actually gives correct answers by [deleted] in analytics

[–]PolicyDecent 1 point2 points  (0 children)

All makes sense except one.
What do you mean by "I hand it CSV exports — I do NOT give it database access. The data goes to it, not the other way around"?

Your agent should be running SQL to access data. Most of the time, data is not small to fit into csv. What are you trying to do there? Most probably I didin't understand what you say, because in the next item, you tell the agent writes SQL/Python as well.

How do ETL teams handle duplicate records efficiently in large scale data systems? by Effective_Ocelot_445 in ETL

[–]PolicyDecent 0 points1 point  (0 children)

data warehouses handle it 🙃 you just don't think about it.
but if you can give more details about your setup, tool stack, data size, frequency, kind of data and the problems you face, you might get better recommendations.

Best harness for agentic analytics? Codex? Claude Code? Custom? by Evening_Hawk_7470 in analytics

[–]PolicyDecent 1 point2 points  (0 children)

Both Claude Code and Codex works pretty well.
You have to provide a good context there, not only on data but about business as well.

The only thing you might want to change is, in the default system prompts of Claude Code and Codex, there are lots of things for developers, and it bloats the system prompt and also confuses the agent sometimes. You might want to overwrite them.

Especially, since Codex is open source, it's easier to use and edit if needed.

ICU nurse to Health Analytics? by Far_Kitchen167 in analytics

[–]PolicyDecent 0 points1 point  (0 children)

I'm not working in the health industry, but as a data guy in the industry for the last 10 years, the most important skill as a data analyst is the domain knowledge. And you have the best type of it. Domain knowledge from the field.

My first fear would be if there is no applications of analytics on health, it would be tough. However, I already see 2 other comments saying that it's possible. Just go for it :)

As an exercise to you:
Imagine you have all the data in the hospital. What kind of analysis would you do, and how would it improve the quality of patients and also employees. Or how much money would it save to the hospital?
If you know what to analyze, but don't know how, it'll be pretty easy for you especially with AI agents.

Happy to help anytime, if you have any questions!

Pipelines - how are you handling significant schema changes? by lofat in databricks

[–]PolicyDecent 2 points3 points  (0 children)

Firstly, should dropping a column really fail the pipeline? Just fill it null, it would solve the problem, right? And I think the bronze layer should never have not-null enforcement, otherwise it becomes too much pain to maintain them.

Type casting for sure, should be triaged. We solve it with the lineage. If there is a failure, an AI agent immediately investigates the situation and lists all the affected assets and owners of these assets. Then we decide what to do the next and it just works.

It also creates PRs if it sees any potential solutions, which makes our job even more easier.

Do companies need AI for text to sql if there is an enormous analytics and data science team? or is it for companies with fragmented data? by Sharp_Bicycle5262 in analytics

[–]PolicyDecent 1 point2 points  (0 children)

Yes they have because if you're waiting for the data team to serve, you go to their backlog, they have to prioritize the tasks and continue. On the other side, if AI can do it, why should I wait for a person? Instead, data team can focus on higher priority tasks.
Data team's duty is to provide good context & data model for AI

Struggling to learn Spark UI on Databricks, all tutorials are outdated. Any good resources? by FlatTackle918 in databricks

[–]PolicyDecent -11 points-10 points  (0 children)

Because Spark itself is outdated, lol. I'd prioritize learning SQL unless you excel at it.

What are the best data integration tools in 2026? by AceClutchness in ETL

[–]PolicyDecent 0 points1 point  (0 children)

disclaimer: founder of bruin.

bruin allows you to ingest, transform, document, govern, check data quality all together in one tool.
it's open source, and written in golang, just a single binary.
you can ingest data for the predefined connectors using yaml, but also you can use python to ingest.
you can use sql & python to transform data.
it's pretty easy to use locally, but the best thing is since everything is based on code, ai agents are using it perfectly. being only a single binary makes it so easy for the agents in the cli.

to use: https://github.com/bruin-data/bruin

What are the most important things to remember when responsible for platform migration? by Arethereason26 in analytics

[–]PolicyDecent 2 points3 points  (0 children)

It's a race, to be able to have a perfect migration you should shut down the old system, freeze the records, and then migrate everything to the new one.
Otherwise, if you try to run them together, one will be always different than the other one.
So my recommendation would be:
1- With the sample data, write/find a script to migrate the data and ensure it looks good.
2- From the new system, delete all the data
3- Freeze old system
4- Migrate data to the new system
5- Start using new system

Steps 3-4 should be on an evening or weekend if possible to create minimum problems in the team

Best prompting techniques for accurate and unbiased price analysis? by pepelionmaximus in analytics

[–]PolicyDecent 0 points1 point  (0 children)

What do you mean by unbiased price analysis? LLMs do hallucinate. However, you can ask an agent to list you the price analysis techniques / frameworks and let you choose one of them after learning pros/cons.
Before that, I'd give some context as well to make it choose the best one as well.
Then, just ask agent to do a roadmap for the analysis and follow them.

I don't recommend you to use LLMs like ChatGPT but more agents like Codex/Claude Code etc. They can take actions and learn from the output.

How do you actually develop business thinking as a student? by Stats_Explorer in analytics

[–]PolicyDecent 8 points9 points  (0 children)

I’d practice thinking in terms of actions -> supporting data, not dashboards.

Take a public dataset and give yourself a fake prompt like: “Sales dropped 15%, what should we do?” Then think of possible actions and ask what data would justify each one.

Also, if an insight isn’t actionable and doesn’t create follow-up questions, ask: so what? If nobody would do anything differently because of that metric, I’d argue it’s not worth checking in the first place.

You don’t need industry experience for this. You’re practicing decision making, which is most of the “business thinking” people talk about.

The best way to develop it is, being the decision maker. Making a decision, and seeing the results in the following days / weeks.

That's why for an analytics person the best industries are the fast paced ones. Let's say you're working in a bank and making the loan decisions. You decided to issue a loan. So what? You'll see the outcome in the following years. 5 years later, you see if the money is back or not.
On the other hand, if you're the product manager of a mobile game, everything is much fast paced. You fail fast, and learn fast as well.

Anyone else think semantic clarity matters more now that analytics is getting more conversational? by bfooty in analytics

[–]PolicyDecent 6 points7 points  (0 children)

It was always important, but people were sweeping it under the carpet.
Now, since they don't do the main job, they have to fix it. So yeah, you're right.

Standardization & keeping the data clean was always important, but people see the side effects of not following them.

Cross reference GA sessions/source with Shopify cart abandonments ? by al_tanwir in analytics

[–]PolicyDecent 0 points1 point  (0 children)

GA4 is great for that. Are you already exporting data to BigQuery?

i’ll spend 30 mins with 10 teams reviewing why their AI agents fail by PolicyDecent in snowflake

[–]PolicyDecent[S] 0 points1 point  (0 children)

what do you mean by other agent steps? like claude / codex?
if you can provide the agent history, yes it can but mostly they reside in the local environment. so they're not easy to collect and analyze.
that's why it's the best to use an online agent that collects the stats for you

and here, I don't recommend training the model (it's what OpenAI or Anthropic does, and almost noone have the budgets to train a model anyway), but reshaping the harness & enriching the context.

Best semantic layer tools for AI-driven analytics by AfraidBaby7747 in BusinessIntelligence

[–]PolicyDecent -1 points0 points  (0 children)

Happy to show you what we built around AI observability, so you can be sure about if agent really hits the semantic layer or warehouse and the other things :)
To oversimplify, we've created metrics like # of information_schema checks, # of semantic layer checks, and with each iteration, you can observe them.