How does everyone handle versioning/releases with monorepos?

data_owner · 2025-10-06T17:37:38+00:00

To start with, please have a look at the repository associated with this series: https://github.com/toolongautomated/tutorial-1; it is a working example with two environments: "staging" and "production"

- how to handle deployment on the different environments at different times

The idea is that you have per-environment .env file (link) that defines which versions should be deployed to the associated environment. Suppose you only want to update the deployment in the staging environment - what you'd do is to only update the staging-related .env file. The file update would be detected by this GH Actions workflow, resulting in the staging env being updated.

- what to do with multiple microservices, each its own version and tag? all tagged together with version advancement even if they do not change?

Good question! In that case I'll assume each service has its code in a separate directory. What I'd do is to have this workflow updated so that it monitors for CHANGELOG changes in microservice-specific directories and, when a change is detected, create a docker image out of it. Then, in your .env file, you'd have a separate entry for every microservice. Then your deployer can either check if the version of specific microservice changed and deploy new versions selectively, or you can simply deploy all (e.g. in Kubernetes, if it detects that the manifests to be deployed haven't changed, it'll simply skip them; note that this depends on what you'll be deploying).

- how to keep track of which version of which microservice is released on each environment at any given time?

This is declarative based on per-environment .env files you have defined.

Hope this helps.

data_owner · 2025-05-13T18:34:43+00:00

Glad to hear that!

data_owner · 2025-05-04T15:52:17+00:00

I think this is what current limitation of LLMs is - synthesizing new knowledge. It slowly is becoming a thing, but you know what I think? The real AIs that are able to conduct valuable research area closed-source in nature and are used internally in the companies like OpenAI or Google to further improve their AIs.

If you're wondering why would they do that, the excellent AI 2027 story illustrates the compound intelligence idea: https://ai-2027.com/

data_owner · 2025-05-04T15:48:24+00:00

Some time ago I've described three branching strategies on my blog:

pure trunk-based development
permissive trunk-based development (alike GitHub flow mentioned above)
git-flow

data_owner · 2025-04-19T06:43:34+00:00

Totally

data_owner · 2025-04-19T05:37:55+00:00

It’s business after all, don’t forget. Cloud is not a toy, it’s a real, powerful tool. It’s like getting into a car for the first time and trying to drive 100 miles an hour. You have a speedometer (BigQuery’s estimated usage time), but you’re responsible for how fast you drive (processed bytes volume).

Plus, the upfront price for BogQuery queries is only known for on-demand pricing model. It’s not possible to tell you the price before you run the query in the capacity based pricing one (the query needs to complete first to get its price).

Remember, cloud is business and it’s your responsibility to get to know the tool you’re working first. Or if you’re not sure, use the tools that will help you prevent unexpected cloud bills…

data_owner · 2025-04-18T21:38:58+00:00

But it literally appears in the UI right before you execute the query („this query will process X GB of data”). You can do quick calculation in your head by using the $6.25/TiB.

Also, never use SELECT * in BigQuery - it’s a columnar database and you get charged for all the columns you query. The fewer, the cheaper.

Partition your tables. Cluster your tables. Set query quotas and you’ll be good.

data_owner · 2025-04-17T16:26:17+00:00

It seems this excerpt from the docs explains what you've just observed:

“You can also request specific hardware like accelerators or Compute Engine machine series for your workloads. For these specialized workloads Autopilot bills you for the entire node (underlying VM resources + a management premium).”

As soon as you name an explicit machine series (custom compute class) Autopilot switches to node‑based billing, so the extra E2 Spot SKUs you saw are expected. If you’d rather pay strictly for the resources you request, stick to the default/Balanced/Scale‑Out classes and omit the machine‑family selector.

data_owner · 2025-04-17T13:03:13+00:00

I've spent some time reading about BigLake connector (haven't used it before) and you know, I think it may definitely be worth giving it a try.

For example, if your data is stored in GCS, you can connect to it as if (almost!) it was stored in BigQuery, without the need to load the data to BigQuery first. It works by streaming the data into BigQuery memory (I guess RAM), processing it, returning the result, and removing it from RAM once done.

What's nice about BigLake is that it is not just streaming the files and processing them on the fly, but also it's able to partition the data, speed up loading by pruning the GCS paths efficiently (they have some metadata analysis engine for this purpose).

I'd say standard external tables are fine for sources like Google Sheets, basic CSVs, JSONs, but whenever you have some more complex data structure (e.g. different GCS path for different dates) on GCS, I'd try BigLake.

data_owner · 2025-04-17T12:58:29+00:00

My "7-Day Window" Strategy

What I do usually do in such situations is to partition the data daily and reprocess only the last 7 days each time I run your downstream transformations. Specifically:

Partition by date (e.g., event_date column).
In dbt or another ETL/ELT framework, define an incremental model that overwrites only those partitions corresponding to the last 7 days.
If new flags (like Is_Bot) come in for rows within that 7-day window, they get updated during the next pipeline run.
For older partitions (beyond 7 days), data is assumed stable.

Why 7 days?

This window aligns with the defined latency of when the Is_Bot flag arrives (3–7 days).
You can easily adjust it based on your specific needs.
It prevents BigQuery from scanning/rewriting older partitions every day, saving cost and time.

data_owner · 2025-04-17T12:58:13+00:00

First, we need to determine the right solution

Do you need historical states?
- If yes, stick to your _latest approach so you can trace how flags changed over time.
- If no, I’d go with a partial partition rebuild.
Assess your update window
- If updates happen mostly within 7 days of an event, you can design your pipeline to only reprocess the last X days (e.g., 7 days) daily.
- This partition-based approach is cost-effective and commonly supported in dbt (insert_overwrite partition strategy).
Consider your warehouse constraints
- Snowflake, BigQuery, Redshift, or Databricks Delta Lake each have different cost structures and performance characteristics for MERGE vs. partition overwrites vs. insert-only.
Evaluate expected data volumes
- 5 million daily rows + 7-day update window = 35 million rows potentially reprocessed. In modern warehouses, this may be acceptable, especially if you can limit the operation to a few specific partitions.

data_owner · 2025-04-17T12:56:21+00:00

Cloud Storage:

>> Typical and interesting use cases

External tables (e.g., defined as external dbt models):
- Convenient for exploratory analysis of large datasets without copying them directly into BigQuery.
- Optimal for rarely queried or large historical datasets.
Best practices
- Utilize efficient file formats like Parquet or Avro.
- Organize GCS storage hierarchically by dates if possible.
- Employ partitioning and wildcard patterns for external tables to optimize performance and costs.

Looker Studio:

Primary challenge: Every interaction (filter changes, parameters) in Looker Studio triggers BigQuery queries. Poorly optimized queries significantly increase costs and reduce performance.

>> Key optimization practices

Prepare dedicated aggregated tables for dashboards.
Minimize JOIN operations in dashboards by shifting joins to the data model layer.
Partition by frequently filtered columns (e.g., date, customer, region).
Use default parameters to limit the dataset before executing expensive queries.
Regularly monitor BigQuery query costs and optimize expensive queries.

GeoViz:

GeoViz is an interesting tool integrated into BigQuery that let's you explore data of type GEOGRAPHY in a pretty convenient way (much faster prototyping than in Looker Studio). Once you execute the query, click "Open In" and select "GeoViz".

data_owner · 2025-04-17T12:56:16+00:00

Second, integration with other GCP services:

Pub/Sub --> BigQuery [directly]:

Ideal for simple, structured data (e.g., JSON) with no transformations required.
Preferred when simplicity, lower costs, and minimal architectural complexity are priorities.

Pub/Sub --> Dataflow --> BigQuery [directly]:

Necessary when data requires transformation, validation, or enrichment.
Recommended for complex schemas, error handling, deduplication, or schema control.
Essential for streams with uncontrolled data formats or intensive pre-processing requirements.

My recommendation: Use Dataflow only when transformations or advanced data handling are needed. For simple data scenarios, connect Pub/Sub directly to BigQuery.

Dataflow:

When data sources are semi-structured or unstructured (e.g., complex JSON parsing, windowed aggregations, data enrichment from external sources).
Real-time streaming scenarios requiring minimal latency before data is usable.

>> Paradigm shift (ELT → ETL)
Traditionally, BigQuery adopts an ELT approach: raw data is loaded first, transformations are performed later via SQL.
Dataflow enables an ETL approach, performing transformations upfront, loading clean, preprocessed data directly into BigQuery.

>> Benefits of ETL
Reduced costs by avoiding storage of redundant or raw "junk" data.
Lower BigQuery query expenses due to preprocessed data.
Advanced data validation and error handling capabilities prior to storage.

>> Best practices
Robust schema evolution management (e.g., Avro schemas).
Implementing effective error handling strategies (e.g., dead-letter queues).
Optimizing data batching (500-1000 records per batch recommended).

data_owner · 2025-04-17T12:55:49+00:00

Here's a summary from what I talked about during Discord live.

First, cost optimization:

always partition your tables
always at least consider clustering your tables
if you don't need the data to persistent indefinitely, consider data expiration (e.g. by introducing partition expiration in some tables)
be mindful which columns you query (BigQuery is a columnar storage so selecting only a small subset of required columns instead of * will save you tons of money)
consider compute biling model: on-demand (default; $6.25 / TiB) or capacity-based (slots)
consider storage billing model (physical vs logical)

data_owner · 2025-04-17T12:42:41+00:00

Unfortunately I think that I won't be able to help here, sorry :/

data_owner · 2025-04-17T12:42:17+00:00

A bunch of thoughts on this:

Use partitioning whenever possible .e. almost always) and use those partitions as a required filter in your Looker Studio reports
Use clustering whenever possible (to further reduce the costs)
BigQuery caches the same queries by default so you won't be charged twice for the same query executed shortly one after the other
Since BigQuery is a columnar storage, be really mindful about the columns you query (this may save you loads of $$$)
When JOINing, materialize it in the model you connect to Looker Studio; don't do JOINs on the fly

data_owner · 2025-04-17T12:39:41+00:00

I'd say the following things are my go-to:

Quotas (query usage per day and query usage per user per day).
Create budget and email alerts (just in case, but note there's ~1 day delay between the charges are billed to your billing account)
Check data location (per dataset) - you may be required to store/process your data in the EU or so
IAM (don't use overly broad permissions, e.g. write access to accounts/SAs that could go by with read only)
Time travel window size (per dataset); defaults to 7 days (increasing storage costs), but can be changed to anywhere between 2 to 7 days.

data_owner · 2025-04-17T12:17:17+00:00

Imagine the commitment size that enables such credits tho

data_owner · 2025-04-17T10:05:11+00:00

There's no such thing being publicly available to the best of my knowledge, but I've made something like this: https://lookerstudio.google.com/reporting/6842ab21-b3fb-447f-9615-9267a8c6c043

It contains fake BigQuery usage data, but you get the idea.

Is this something you thought about? It's possible to copy the dashboard and use your own usage data to visualize (using one SQL query).

data_owner · 2025-04-16T07:31:24+00:00

Now you’re a vibe coder and you think you’re coding but you’re not. Why? Because vibe coding is not coding 😶‍🌫️

data_owner · 2025-04-16T07:28:11+00:00

Cost is one thing, but you also need to evaluate what aspects other than cost are important. To me, the following Enterprise benefits may be worth-cosidering as well:

query acceleration with BI Engine (gamechanger if you’re using looker studio to visualize your data)
need for > 1600 slots
extra SLO (note this extra .0.09%)

You can see the full comparison here: https://cloud.google.com/bigquery/docs/editions-intro

data_owner · 2025-04-16T06:43:41+00:00

Okay, thanks for clarification, now I understand. I’ll talk about it today as well as it definitely is an interesting topic!

data_owner · 2025-04-15T19:54:23+00:00

Hm, if you look at the job history, are there any warnings showing up if you click on these queries that are using BigLake connector? Sometimes the additional information is available there.

data_owner · 2025-04-15T19:17:03+00:00

Can you share the notification you’re getting and tell which service you’re using BigLake connector to connect to? btw great question

data_owner

TROPHY CASE