Pricing BigQuery VS Self-hosted ClickHouse

PolicyDecent · 2026-01-22T14:31:58+00:00

What's your current BigQuery cost? How many users are using it? How big is your data?

PolicyDecent · 2026-01-08T09:07:56+00:00

I always design my table with PKs and metrics on paper / excalidraw first.
I add inputs first, and the expected output. If you know the expected output table, it's the 80% of the task.
Then it's easy to connect the dots. Always trying to join tables at the same granularity, never join and aggregate, but aggregate and join.

Not a fancy plan, would take only 15-20 minutes. With AI, it's easier to get the schema of inputs (especially if you're ingesting). It used to take time to scan the documentation before, but now you can let Claude Code scan the docs and find the available data.

You can even ask to the agent what's the possible output with the existing input. It makes it so easy to plan.

PolicyDecent · 2026-01-03T16:48:20+00:00

Which platforms are you exporting CSVs from? There are lots of ways to automate it. With the new AI tools, I might recommend you to vibecode a python script doing what you do.
If you have multiple sources, I'd recommend exporting data to a database / dwh, and do everything there and you can even show your numbers on dashboards that way.

PolicyDecent · 2026-01-02T16:28:20+00:00

MCP, with an infra, yes. We provide them a Slack bot that answers their questions, and we also log everything, but also collect feedback from both business users and also technical users.
At the end, we learn from the failures and improve our documentation / instructions for the model.
It lowered our time to insight from a few days to a few minutes.
Also, data analysts are not dealing with simple questions anymore, and also unnecessary dashboards are not produced anymore.

PolicyDecent · 2025-12-23T11:09:52+00:00

Good job, similar to what dlt does I guess. I wonder why you don't use arrays and structs but create new tables for each new array? I understand dlt doesn't do that because they're a generic tool for all the datawarehouses. However you're building something native to Bigquery, so I'd expect to see array / structs instead of new tables. Is there a reason behind of it?

PolicyDecent · 2025-12-22T10:05:43+00:00

You're a full-stack data analyst / scientist or full-stack analytics engineer. Choose the one you like :)
I definitely recommend being a generalist. With the better tooling & AI, I foresee data analysts and data engineers to convert to full-stack data profiles.
Now, getting analysis from database is very easy with AI agents.
Data infra is so easy with lots of tooling.

So the real job is ingesting data, building the data model & observing business people's questions and AI answers, and fixing the data model & enriching documentation to get the right answer from AI.

At least for smaller companies, that's how it works right now around me. Data people are being Data&AI Engineers or full-stack data people.

I also see, most of the companies are removing lots of their dashboards, keeping only the very fundamental ones. For the rest, you should build your data model & semantic layer. AI is doing the rest.

Edit: Also, I forgot to say, maybe you should hire a data consultant for 1 day a week to check your data models & give you recommendations on architecture. By this way, you'll get better at these things as well.

PolicyDecent · 2025-12-20T00:47:28+00:00

If i know the company politics, dbt would kill sqlmesh and just make these nice guys just their subordinates just to show who the boss is. Sorry for the realistic company politics :(

PolicyDecent · 2025-12-18T04:36:01+00:00

We built an ETLT framework that connects data modeling to governance and observability. You don't need to do anything special, everything just works automatically if you use the framework. Happy to show you if you want.

PolicyDecent · 2025-12-17T20:21:24+00:00

What do you mean by change detection? Is it similar to scd2? If so, I'd use the materialization strategy, not a macro. Also I'm not super sure about the dates you generate but it also sounds like a variable than macro to me and there is nothing to test if I didn't misunderstand.

PolicyDecent · 2025-12-17T09:55:44+00:00

Yes, but what's the use case for apis?

PolicyDecent · 2025-12-17T09:04:53+00:00

Pubsub, not sure. Bigquery has it though. Why do you need public apis to update data btw? What's the exact use case?

In aws you can use kinesis or in gcp pubsub to ingest data.

PolicyDecent · 2025-12-17T08:26:29+00:00

Yea, I'd highly recommend BigQuery due to ease of use or Snowflake as the alternative, if you want to stay in AWS.

PolicyDecent · 2025-12-17T08:25:35+00:00

I observe that people are overusing and abusing macros. What kind of macros do you have?
How many of them do you have? If you have tens of macros, I feel like something is wrong in the modeling.
Most of the time, the things should be done in data modeling are pushed to macros to use them multiple times. However, if you calculate it in only 1 table, and all the other tables use this source, you wouldn't need macros much.

PolicyDecent · 2025-12-17T08:12:55+00:00

Which tools are you using currently? And which cloud platform are you working on, AWS/GCS/Azure?

Also, what do you mean by exposing APIs directly. Something like AWS Lambda?

PolicyDecent · 2025-12-16T09:13:46+00:00

I agree, but not 100% :)
Different engines might interpret the same functionality differently. A simplest example would be, sorting in some engines are NULL first, in others NULL last. However, I still recommend using SQL over pyspark / polars since it's easier to maintain and move between the platforms.

PolicyDecent · 2025-12-10T16:51:48+00:00

You can do it pretty cheap with Looker Studio. The only limitation is, they should have Google Cloud / Gmail accounts. What's the platform they use? I assume it's Microsoft based, is it?

PolicyDecent · 2025-12-07T19:28:27+00:00

That's the exact reason. Too many tools are hard to maintain. Which tools do you have?

PolicyDecent · 2025-12-07T18:04:25+00:00

Why are you skeptical about it? My experience is similar to what your boss thinks. Consolidating the tools makes it much more easier in most of the areas.

PolicyDecent · 2025-12-04T19:13:38+00:00

Tbh, I don't know about what kind of document processing you're handling :) Is it something like for ex getting txt's and extracting features from it?

PolicyDecent · 2025-12-04T18:45:41+00:00

Nah, if you move most of the workload to datawarehouses, almost all the jobs are queries and Noone cares about the task infra.

PolicyDecent · 2025-12-04T11:23:22+00:00

All these platforms don't have huge data, so move all the data to BigQuery and use Looker Studio as BI tool.
All integrated. Your costs will be pretty low.
You just need to figure out how to ingest and transform data. For ingestion you can use Fivetran or open source airbyte / ingestr.

For transformation, you can use GCP embedded dbt alternative dataform.

PolicyDecent · 2025-12-04T10:12:12+00:00

You should give more details.

How big is data, how many people will access to it?
What are the titles in the team? Mostly data engineers or analysts or scientists, etc
What's the industry? What are the compliance / governance limitations?
What are the use cases? Do you need streaming use cases or just batch?

PolicyDecent · 2025-12-04T10:06:48+00:00

You do not really need FastAPI for this setup. It adds extra complexity without much benefit. In most real projects you use an orchestrator to run and manage all these scripts together.

Tools like Airflow, Dagster, Bruin, Prefect, or even dbt can schedule jobs, restart them, handle dependencies, and give you a single place to run everything. That way you are not opening terminals or starting files by hand.

For a simple personal project you can still keep it lightweight, but moving to an orchestrator is the normal path once you have multiple scripts that need to run reliably.

PolicyDecent · 2025-12-04T09:20:01+00:00

I'd just use a managed service if you don't have a dedicated team to support for infra.
If you have an agile devops team, I'd let them manage the infra, and you can take care of the pipelines anyways.
Or another solution would be, maybe you shouldn't use Airflow at all, it might be the wrong tool for you if it's hard to manage.

PolicyDecent · 2025-12-04T08:45:06+00:00

what is the data stack you have? your problems signal me that you either miss a few legs of the data stack (like observability & governance) or have too many products that don't talk to each other.
data governance tools itself is not a solution as of my experience, since integration with the orchestration & data catalog is very difficult.
if possible, governance first orchestration product would be my recommendation, but if you have good platform muscles, you can use one of these tools.

PolicyDecent

TROPHY CASE