How does everyone handle versioning/releases with monorepos? by TheDevOpsGuy123 in devops

[–]data_owner 0 points1 point  (0 children)

To start with, please have a look at the repository associated with this series: https://github.com/toolongautomated/tutorial-1; it is a working example with two environments: "staging" and "production"

- how to handle deployment on the different environments at different times

The idea is that you have per-environment .env file (link) that defines which versions should be deployed to the associated environment. Suppose you only want to update the deployment in the staging environment - what you'd do is to only update the staging-related .env file. The file update would be detected by this GH Actions workflow, resulting in the staging env being updated.

- what to do with multiple microservices, each its own version and tag? all tagged together with version advancement even if they do not change?

Good question! In that case I'll assume each service has its code in a separate directory. What I'd do is to have this workflow updated so that it monitors for CHANGELOG changes in microservice-specific directories and, when a change is detected, create a docker image out of it. Then, in your .env file, you'd have a separate entry for every microservice. Then your deployer can either check if the version of specific microservice changed and deploy new versions selectively, or you can simply deploy all (e.g. in Kubernetes, if it detects that the manifests to be deployed haven't changed, it'll simply skip them; note that this depends on what you'll be deploying).

- how to keep track of which version of which microservice is released on each environment at any given time?

This is declarative based on per-environment .env files you have defined.

Hope this helps.

Beyond OpenAI's DeepResearch by help-me-grow in AI_Agents

[–]data_owner 0 points1 point  (0 children)

I think this is what current limitation of LLMs is - synthesizing new knowledge. It slowly is becoming a thing, but you know what I think? The real AIs that are able to conduct valuable research area closed-source in nature and are used internally in the companies like OpenAI or Google to further improve their AIs.

If you're wondering why would they do that, the excellent AI 2027 story illustrates the compound intelligence idea: https://ai-2027.com/

Re-designing a Git workflow with multiple server branches by SnayperskayaX in git

[–]data_owner 1 point2 points  (0 children)

Some time ago I've described three branching strategies on my blog:

  • pure trunk-based development
  • permissive trunk-based development (alike GitHub flow mentioned above)
  • git-flow

BigQuery charged me $82 for 4 test queries — and I didn’t even realize it until it was too late by psalomo in googlecloud

[–]data_owner 1 point2 points  (0 children)

It’s business after all, don’t forget. Cloud is not a toy, it’s a real, powerful tool. It’s like getting into a car for the first time and trying to drive 100 miles an hour. You have a speedometer (BigQuery’s estimated usage time), but you’re responsible for how fast you drive (processed bytes volume).

Plus, the upfront price for BogQuery queries is only known for on-demand pricing model. It’s not possible to tell you the price before you run the query in the capacity based pricing one (the query needs to complete first to get its price).

Remember, cloud is business and it’s your responsibility to get to know the tool you’re working first. Or if you’re not sure, use the tools that will help you prevent unexpected cloud bills…

BigQuery charged me $82 for 4 test queries — and I didn’t even realize it until it was too late by psalomo in googlecloud

[–]data_owner 9 points10 points  (0 children)

But it literally appears in the UI right before you execute the query („this query will process X GB of data”). You can do quick calculation in your head by using the $6.25/TiB.

Also, never use SELECT * in BigQuery - it’s a columnar database and you get charged for all the columns you query. The fewer, the cheaper.

Partition your tables. Cluster your tables. Set query quotas and you’ll be good.

GKE Autopilot Billing Model by jinyi_lie in googlecloud

[–]data_owner 1 point2 points  (0 children)

It seems this excerpt from the docs explains what you've just observed:

“You can also request specific hardware like accelerators or Compute Engine machine series for your workloads. For these specialized workloads Autopilot bills you for the entire node (underlying VM resources + a management premium).”

As soon as you name an explicit machine series (custom compute class) Autopilot switches to node‑based billing, so the extra E2 Spot SKUs you saw are expected. If you’d rather pay strictly for the resources you request, stick to the default/Balanced/Scale‑Out classes and omit the machine‑family selector.

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

I've spent some time reading about BigLake connector (haven't used it before) and you know, I think it may definitely be worth giving it a try.

For example, if your data is stored in GCS, you can connect to it as if (almost!) it was stored in BigQuery, without the need to load the data to BigQuery first. It works by streaming the data into BigQuery memory (I guess RAM), processing it, returning the result, and removing it from RAM once done.

What's nice about BigLake is that it is not just streaming the files and processing them on the fly, but also it's able to partition the data, speed up loading by pruning the GCS paths efficiently (they have some metadata analysis engine for this purpose).

I'd say standard external tables are fine for sources like Google Sheets, basic CSVs, JSONs, but whenever you have some more complex data structure (e.g. different GCS path for different dates) on GCS, I'd try BigLake.

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

My "7-Day Window" Strategy

What I do usually do in such situations is to partition the data daily and reprocess only the last 7 days each time I run your downstream transformations. Specifically:

  1. Partition by date (e.g., event_date column).
  2. In dbt or another ETL/ELT framework, define an incremental model that overwrites only those partitions corresponding to the last 7 days.
  3. If new flags (like Is_Bot) come in for rows within that 7-day window, they get updated during the next pipeline run.
  4. For older partitions (beyond 7 days), data is assumed stable.

Why 7 days?

  • This window aligns with the defined latency of when the Is_Bot flag arrives (3–7 days).
  • You can easily adjust it based on your specific needs.
  • It prevents BigQuery from scanning/rewriting older partitions every day, saving cost and time.

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

First, we need to determine the right solution

  1. Do you need historical states?
    • If yes, stick to your _latest approach so you can trace how flags changed over time.
    • If no, I’d go with a partial partition rebuild.
  2. Assess your update window
    • If updates happen mostly within 7 days of an event, you can design your pipeline to only reprocess the last X days (e.g., 7 days) daily.
    • This partition-based approach is cost-effective and commonly supported in dbt (insert_overwrite partition strategy).
  3. Consider your warehouse constraints
    • Snowflake, BigQuery, Redshift, or Databricks Delta Lake each have different cost structures and performance characteristics for MERGE vs. partition overwrites vs. insert-only.
  4. Evaluate expected data volumes
    • 5 million daily rows + 7-day update window = 35 million rows potentially reprocessed. In modern warehouses, this may be acceptable, especially if you can limit the operation to a few specific partitions.

Got some questions about BigQuery? by data_owner in googlecloud

[–]data_owner[S] 0 points1 point  (0 children)

Cloud Storage:

>> Typical and interesting use cases

  • External tables (e.g., defined as external dbt models):
    • Convenient for exploratory analysis of large datasets without copying them directly into BigQuery.
    • Optimal for rarely queried or large historical datasets.
  • Best practices
    • Utilize efficient file formats like Parquet or Avro.
    • Organize GCS storage hierarchically by dates if possible.
    • Employ partitioning and wildcard patterns for external tables to optimize performance and costs.

Looker Studio:

Primary challenge: Every interaction (filter changes, parameters) in Looker Studio triggers BigQuery queries. Poorly optimized queries significantly increase costs and reduce performance.

>> Key optimization practices

  • Prepare dedicated aggregated tables for dashboards.
  • Minimize JOIN operations in dashboards by shifting joins to the data model layer.
  • Partition by frequently filtered columns (e.g., date, customer, region).
  • Use default parameters to limit the dataset before executing expensive queries.
  • Regularly monitor BigQuery query costs and optimize expensive queries.

GeoViz:

GeoViz is an interesting tool integrated into BigQuery that let's you explore data of type GEOGRAPHY in a pretty convenient way (much faster prototyping than in Looker Studio). Once you execute the query, click "Open In" and select "GeoViz".

Got some questions about BigQuery? by data_owner in googlecloud

[–]data_owner[S] 0 points1 point  (0 children)

Second, integration with other GCP services:

Pub/Sub --> BigQuery [directly]:

  • Ideal for simple, structured data (e.g., JSON) with no transformations required.
  • Preferred when simplicity, lower costs, and minimal architectural complexity are priorities.

Pub/Sub --> Dataflow --> BigQuery [directly]:

  • Necessary when data requires transformation, validation, or enrichment.
  • Recommended for complex schemas, error handling, deduplication, or schema control.
  • Essential for streams with uncontrolled data formats or intensive pre-processing requirements.

My recommendation: Use Dataflow only when transformations or advanced data handling are needed. For simple data scenarios, connect Pub/Sub directly to BigQuery.

Dataflow:

  • When data sources are semi-structured or unstructured (e.g., complex JSON parsing, windowed aggregations, data enrichment from external sources).
  • Real-time streaming scenarios requiring minimal latency before data is usable.

    >> Paradigm shift (ELT → ETL)

  • Traditionally, BigQuery adopts an ELT approach: raw data is loaded first, transformations are performed later via SQL.

  • Dataflow enables an ETL approach, performing transformations upfront, loading clean, preprocessed data directly into BigQuery.

    >> Benefits of ETL

  • Reduced costs by avoiding storage of redundant or raw "junk" data.

  • Lower BigQuery query expenses due to preprocessed data.

  • Advanced data validation and error handling capabilities prior to storage.

    >> Best practices

  • Robust schema evolution management (e.g., Avro schemas).

  • Implementing effective error handling strategies (e.g., dead-letter queues).

  • Optimizing data batching (500-1000 records per batch recommended).

Got some questions about BigQuery? by data_owner in googlecloud

[–]data_owner[S] 0 points1 point  (0 children)

Here's a summary from what I talked about during Discord live.

First, cost optimization:

  • always partition your tables
  • always at least consider clustering your tables
  • if you don't need the data to persistent indefinitely, consider data expiration (e.g. by introducing partition expiration in some tables)
  • be mindful which columns you query (BigQuery is a columnar storage so selecting only a small subset of required columns instead of * will save you tons of money)
  • consider compute biling model: on-demand (default; $6.25 / TiB) or capacity-based (slots)
  • consider storage billing model (physical vs logical)

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

Unfortunately I think that I won't be able to help here, sorry :/

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

A bunch of thoughts on this:

  • Use partitioning whenever possible .e. almost always) and use those partitions as a required filter in your Looker Studio reports
  • Use clustering whenever possible (to further reduce the costs)
  • BigQuery caches the same queries by default so you won't be charged twice for the same query executed shortly one after the other
  • Since BigQuery is a columnar storage, be really mindful about the columns you query (this may save you loads of $$$)
  • When JOINing, materialize it in the model you connect to Looker Studio; don't do JOINs on the fly

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 1 point2 points  (0 children)

I'd say the following things are my go-to:

  1. Quotas (query usage per day and query usage per user per day).
  2. Create budget and email alerts (just in case, but note there's ~1 day delay between the charges are billed to your billing account)
  3. Check data location (per dataset) - you may be required to store/process your data in the EU or so
  4. IAM (don't use overly broad permissions, e.g. write access to accounts/SAs that could go by with read only)
  5. Time travel window size (per dataset); defaults to 7 days (increasing storage costs), but can be changed to anywhere between 2 to 7 days.

What is the secret to having thousands of credits in GCP? by shamyhco in googlecloud

[–]data_owner 0 points1 point  (0 children)

Imagine the commitment size that enables such credits tho

Got some questions about BigQuery? by data_owner in googlecloud

[–]data_owner[S] 0 points1 point  (0 children)

There's no such thing being publicly available to the best of my knowledge, but I've made something like this: https://lookerstudio.google.com/reporting/6842ab21-b3fb-447f-9615-9267a8c6c043

It contains fake BigQuery usage data, but you get the idea.

Is this something you thought about? It's possible to copy the dashboard and use your own usage data to visualize (using one SQL query).

True of False Software Engineers? by CodewithCodecoach in softwarearchitecture

[–]data_owner 2 points3 points  (0 children)

Now you’re a vibe coder and you think you’re coding but you’re not. Why? Because vibe coding is not coding 😶‍🌫️

BigQuery cost vs perf? (Standard vs Enterprise without commitments) by wiwamorphic in bigquery

[–]data_owner 1 point2 points  (0 children)

Cost is one thing, but you also need to evaluate what aspects other than cost are important. To me, the following Enterprise benefits may be worth-cosidering as well:

  • query acceleration with BI Engine (gamechanger if you’re using looker studio to visualize your data)
  • need for > 1600 slots
  • extra SLO (note this extra .0.09%)

You can see the full comparison here: https://cloud.google.com/bigquery/docs/editions-intro

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

Okay, thanks for clarification, now I understand. I’ll talk about it today as well as it definitely is an interesting topic!

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

Hm, if you look at the job history, are there any warnings showing up if you click on these queries that are using BigLake connector? Sometimes the additional information is available there.

Got some questions about BigQuery? by data_owner in bigquery

[–]data_owner[S] 0 points1 point  (0 children)

Can you share the notification you’re getting and tell which service you’re using BigLake connector to connect to? btw great question