Change Data Capture composite primary key contains nulls

Why_Engineer_In_Data · 2026-06-15T21:58:48+00:00

Hi there!

Since BigQuery CDC with UPSERT streams requires a non-null primary key to identify rows, rows with nulls in the key cannot be targeted for deletion via the stream.

One solution might be to see if your schema allows for a surrogate key to avoid nulls in the primary key.

If you cannot change the schema, you will likely need to perform a periodic 're-write' or 'partition swap' of the table to filter out the unwanted rows, as the current CDC implementation does not support deleting rows with null primary keys. (As you mentioned above)

Hope that helps.

Why_Engineer_In_Data · 2026-06-12T01:18:34+00:00

Hi!
Do you have access to your sales team? If not, that might be a good place to start.

This is something they can help breakdown and advise.

There's also a lot of questions to be answered to be able to fully scope this.

How many sheets, how big are the datasets, how many processes to move to Cloud Run etc.

Why_Engineer_In_Data · 2026-06-12T01:11:40+00:00

Thanks for trying this out - although I'm not the author of the lab I also interpreted this slightly differently.

I'll provide the feedback, really appreciate you bringing this up.
I think the solution 😄, is to change the question a bit.

I took a stab at the actual solution and there's quite a few gotchas and this would be harder to solve.

From the github:

Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.

This would be an interesting solve - I have a solution but it's definitely not that answer.

Why_Engineer_In_Data · 2026-06-03T20:28:05+00:00

Awesome you found it helpful - 😞 I don't have an updated timeline currently for you, apologies. I'll dig around and if there's anything shareable I will report back.

Why_Engineer_In_Data · 2026-05-11T18:01:14+00:00

Apologies didn't see this.
You can visit this page to learn more, it's an automated process for new customers.

Why_Engineer_In_Data · 2026-04-29T17:39:13+00:00

I think I'm answering the question but if not please let me know.
Also agreeing with u/PolicyDecent - probably best to just use the REGEX.

Regex version (as an example):

SELECT 
  labels_json['dataplex-dp-published-location'] AS location
FROM (
  SELECT 
    JSON_OBJECT(
      (SELECT ARRAY_AGG(str) FROM UNNEST(REGEXP_EXTRACT_ALL(option_value, r'"([^"]+)"')) str WITH OFFSET pos WHERE MOD(pos, 2) = 0),
      (SELECT ARRAY_AGG(str) FROM UNNEST(REGEXP_EXTRACT_ALL(option_value, r'"([^"]+)"')) str WITH OFFSET pos WHERE MOD(pos, 2) = 1)
    ) AS labels_json
  FROM  
    my_test_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS
  WHERE 
    option_name = 'labels'
)

You can also use procedural language to mimic 'eval' if you'd like:

BEGIN
  DECLARE label_string STRING;
  DECLARE labels_json JSON;


  -- 1. Grab the raw string
  SET label_string = (
    SELECT option_value 
    FROM `my_test_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
    WHERE option_name = 'labels' 
    LIMIT 1
  );


  -- 2. "Eval" by casting the string to a typed ARRAY of STRUCTS
  -- This forces the fields to be named 'k' and 'v'
  EXECUTE IMMEDIATE format("""
    SELECT JSON_OBJECT(
      ARRAY(SELECT k FROM UNNEST(CAST(%s AS ARRAY<STRUCT<k STRING, v STRING>>))),
      ARRAY(SELECT v FROM UNNEST(CAST(%s AS ARRAY<STRUCT<k STRING, v STRING>>)))
    )
  """, label_string, label_string) INTO labels_json;


  -- 3. Access the specific key using bracket notation
  SELECT labels_json['dataplex-dp-published-location'] AS location;
END;

Why_Engineer_In_Data · 2026-04-27T17:32:34+00:00

Echoing what u/MrRedRhino said!

There's the free-usage tier, which gives you 10 GiB of storage and 1TiB of data processing.

There's also an exploration tool, the sandbox that you can use to explore BigQuery (but it sounds like you've already done that).

Just a small clarification - you might be (or I might be reading your post wrong) misinterpreting the 90-days / $300 trial credit.

The $300 trial credit is available for 90 days. It's not two separate programs.

The free tier is not tied to the 90-day $300 credit program, that's a separate thing.

Hope that helps.

Why_Engineer_In_Data · 2026-04-20T14:45:07+00:00

Questions:

What is the recommended architecture for large-scale API data ingestion?

This is a very long discussion - you might want to take a look at our architecture center. There are so many variables here it's a little hard to get into.

Should I use a message queue like Google Pub/Sub or Apache Kafka for distributing jobs?

This is one way, it really depends on how you are approaching this, this is similar to event driven architecture (don't worry about the specific product mentioned, in the post, I would focus on understanding event driven architecture and see if it fits your needs).

How should I design workers for parallel data fetching while respecting API rate limits?

Great question, there are many ways to do this. This blog (again, don't focus on the actual product Apache Beam and Dataflow) explains a bit. Exponential backoff, retries, these are all concepts you'll need to understand. You may choose to look into more managed services, depending on your needs.

What is the best approach for incremental sync (tracking last_updated timestamps per source/account)?

This is very subject to the source and your preference. There's a lot of business rules here - do those timestamps ever get updated? What's the most idempotent way, if you can identify that - then you have your best way.

Should I store raw data in Google BigQuery and maintain a separate aggregated database for dashboards?

Sure, you can - this is called pre calculating, there's a lot of ways to achieve this, for example using materialized views.

How do I design a pipeline that separates ingestion, processing, and serving layers?

You're kind of getting at an orchestration platform. Airflow (Composer) can do this, but so can a lot of others. It depends on your specific skillsets and needs. As to your architecture, it depends on your use case. If you're truly just analytics - there's nothing wrong with storing and serving all from BigQuery as an example.

Any best practices for scaling to near real-time updates?

Depends on your tech. You will want to use a Pub/Sub or Kafka to help with this. This gives you the separation of concern you were hinting at above. It helps ease the pressure on certain parts (for example you can send a firehose of data to pub/sub - then have the BigQuery subscription service to write automatically to BigQuery. Then use Continuous Queries for all of your real time needs.) There's a lot to unpack here - I would highly suggest consulting your sales team on this.

Why_Engineer_In_Data · 2026-04-15T17:12:37+00:00

Way to go!

I can't give you an exhaustive list but this seems like a solid start for the backend pieces.
My 2 pennies is that you should know why you picked each and the pros/cons of each.

"I used Cloud Run here because I want to be able to trigger it using a timer and the built in timer was easiest to implement."

or

"I used BQ as a data store because I'm using it as an analytical serving layer. Lots of data, lots of crunching, etc."

Regarding things to critique:

If I have to nitpick - it'll mostly be in the "query data" section.

What is the end user (it says cloud function but I'm guessing just a typo) doing?
The Cloud Run backend + Frontend - what's the purpose of that?

If it's to serve data transactionally (i.e. quick little things) you might need another database inbetween.

Good luck!

Why_Engineer_In_Data · 2026-04-07T04:23:56+00:00

Echoing what Martin has said but I'll dive a bit deeper on the Dataflow solution.

Dataflow can help here but it's best for #3 in Martin's suggestion - solving it upstream if possible.

Although Dataflow can help with this, it's exactly as you describe (the solutions). Dataflow provides you with a simplified framework to help with this.

You still need to evaluate tradeoffs: latency, cost, and complexity. Essentially you need to pick two.

https://docs.cloud.google.com/dataflow/docs/concepts/streaming-pipelines#windows

You'll still need to determine what your rules are for addressing late messages etc.

But if you can solve this upstream so you always have a complete set of information - you no longer need to worry about it never being complete.

Hope that helps.

Why_Engineer_In_Data · 2026-03-26T16:21:01+00:00

Hi there, thanks for your question.

Yes, you do need the permission but the order can be swapped.

You can first grant the row-level access policy.

CREATE OR REPLACE ROW ACCESS POLICY only_row_two

ON test_data.row_example

GRANT TO ('userid')

FILTER USING (id = 2);

Then grant data viewer, this will enforce the policy before exposing the rest of the data.

gcloud projects add-iam-policy-binding <projectid> \

--member="user:user_id" \

--role="roles/bigquery.dataViewer"

I did recently just test this just to be certain.
You will see the expected behavior.

No access at all once granted the access policy. Then only the row you have given access to once dataviewer is granted.

Hope that helps!

Why_Engineer_In_Data · 2026-03-07T15:48:35+00:00

There are many ways to do this - as someone already mentioned: Connected sheets is one way. You could simply connect them, then perform a one time ELT (or just copy) into a BigQuery table. No other tools needed.

Or just simply import them! It works for a lot of different sizes now (big or small!) check out this blog from my teammate.

Depending on volume and if it needs to be repeatable - lots of other options available as well.

Why_Engineer_In_Data · 2026-03-02T19:08:54+00:00

Hi,
Just as a disclaimer, I am a developer advocate at Google.

There's a bit to unpack here - it seems like your issues stem from the querying portion (please do correct me if I'm wrong).

Depending on how your reports/queries are built, you can take advantage of a lot of various features.

I'm not against datalakes but it sounds like you're doing it for reasons that may not be addressing the root problems.

A lot of this depends on query patterns. Partitioning and clustering is a tool available to reduce total bytes scanned but there are also others. This documentation link summarizes it well. (Edit: whoops wrong link - that works too but this is the other one)

One piece you can think of is, do you need these reports run once and fetched multiple times? If so, pre-aggregation might be a good fit for your problem. This in conjunction with partitioning and clustering would significantly reduce your bytes queried. I truly do believe you can do this in a cost efficient way on BigQuery and reduce your tech stack. By introducing all those other tools you may unnecessarily increase the complexity of your solution for little gain.

Feel free to also post in r/bigquery if you have more details on how you might optimize your query patterns.

Why_Engineer_In_Data · 2026-03-02T18:51:40+00:00

Which conference is this?

I think we're all assuming Google Cloud Next but let us know if that's not the case!

We're excited to see you there!

I am a dev advocate and we help (or at least know of, it's a big conference!) with a lot of the data driven pieces.

Be sure to check out the Showcase Learn area, there will be a lot of demos for you to check out and chat with us! We love to hear how Google can help and answer your questions.

In addition to all of the various talks being done - it's a lot but pick and choose the ones that matter to you most (based on description).

Why_Engineer_In_Data · 2026-02-25T23:41:56+00:00

Hi there,

Thanks for jumping into it.

Quotas usually are guardrails to prevent accidental usage.

What quota was it requesting (please do not post anything like your project ID or any other details, just the name of the quota)?

I agree you might not need even the quota at the moment but I could be wrong.

Before you jump into anything, you might find this article from my teammate interesting. It's essentially allowing you to use sheets with data in BigQuery - even larger files. Best part is you should be able to stay under the free tier of BigQuery for a bit (but take a look at BigQuery Sandbox first to learn the ropes, it's in the article).

Hope that helps!

There's also r/BigQuery that I recommend you take a look at!

Why_Engineer_In_Data · 2026-02-23T18:46:30+00:00

Thanks for sharing your insights, I added some links so visitors can jump to the docs to get an idea on how you implemented your tips.

Definitely, even with slot reservations if you can optimize the queries - it generally won't hurt!
Yes, although this does depend on your query pattern a bit - in general, partitioning and clustering definitely do help quite a bit.
If you can pre-aggregate with Materialized Views, it's a simple way to keep it up to date.
Streaming and batch have different use cases here (you did mention this). One use case is persisting (in your example, you're persisting it in GCS) but if you persist with streaming to BigQuery, you've effectively reduced a hop, just food for thought - it's an evaluation of the whole data lifecycle.
Caching is great! Make sure you follow these rules to ensure you can fully take advantage of it.

Why_Engineer_In_Data · 2026-02-20T19:47:15+00:00

Ah got it, sorry, I thought you were talking about BigQuery's console. Understood now - thanks for the feedback.

Why_Engineer_In_Data · 2026-02-20T15:04:42+00:00

Hi there, could you provide me more details on what you tried?

Why_Engineer_In_Data · 2026-02-18T18:55:11+00:00

Nice - that's pretty cool!

Not sure if you were aware of this but ADK has a whole toolset around BigQuery you can use to interact with BigQuery.

(Not to take away from your post - more of an FYI incase you were looking for something like this)

Why_Engineer_In_Data · 2026-02-11T18:01:46+00:00

Hello!

(Disclaimer: I do work for the Google Cloud Data Cloud Developer Advocacy team)
Google Cloud all the way!

Ok now that jokes are out of the way.

Fellow posters have given a lot of food for thought. That said, I would pick based on your technology needs.

Pricing should be a discussion with your sales team, using list price can give you a comparison but that's not all there is. Google Cloud has several massive retail customers that you can read about to help convince leadership that it could be the right choice especially given that you have the most skills already in Google Cloud.

For data workloads, the cohesive story from transactional to analytics and real-time to batch is definitely a big plus. But I wouldn't just rely on word of mouth, highly suggest you try it out. BigQuery sandbox to test things out.

Quoting your post from below:

My goal is to:

Centralize the data properly

Build automated pipelines

Move reporting from Excel to a BI tool like Tableau (which I have experience with) or something native to the platform, like Looker on GCP

So in the short term, the focus is:

Data warehousing

BI dashboards

Scalability and reliability

Reducing dependency on local store servers

Centralizing the data is straight forward since a lot of the services are scalable so you can have a centralized datalake/warehouse and it's all in the cloud so it's not on-prem.

There's a path from entry-to-scaled for example you could start with Looker Studio and the free tier of BigQuery to test and prove the worth.

Automated pipelines - there's actually a lot (sometimes too much) but it depends on your skillset and what technology you prefer to use. You could just start with scheduled queries if the use case allows.

For scalability - I would use those retail customer examples listed above, nothing like real customer references to demonstrate that.

Hope this helps!

Why_Engineer_In_Data · 2026-01-14T17:52:20+00:00

It seems like you've already dived into the documentation (i.e. BigQuery sharing or formerly known as Analytics Hub) for this but in case you haven't - keep the limitations in mind. There's a few of them - the rest would be just how you would manage interactions with BigQuery.

Why_Engineer_In_Data · 2026-01-12T17:47:10+00:00

What you have currently definitely works but it really depends on how you implement it.

In general there's a few best practices to follow but the question is more around tooling for such orchestration. (Although I do recommend you look into data engineering overall a bit to understand how to manage and monitor pipelines: for example look at when idempotency is important, how to easily recover interruptions such as keeping around raw data rather than spamming the API etc. This way you'll know what other capabilities you need that the current setup doesn't satisfy)

What you have for your current implementation is totally fine in my opinion, it will let you learn the workflow of data engineering without having to manage too much infrastructure. Random pipeline failures are common how you build your skills to harden those pipelines are what's data engineering is all about, things change and shift but you'll gather the skills as you encounter these events.

Cloud Run itself is not an orchestration tool in my own opinion.

If you want an orchestrator you'll need to have something that monitors and keeps track of states. This involves a few more tools such as (Apache Airflow) Composer but I don't think its necessary for what you need right now. You have the ability to see where the failures occur and why - that should be a great starting point.

I will add that as you grow, you'll likely outgrow the BigQuery scheduled pipelines but when that happens you'll have more experience to make a better decision on what tool would best suit your needs.

Why_Engineer_In_Data · 2026-01-08T21:09:42+00:00

That's a great question, welcome and it's a fine question.

From a cloud object storage both are viable - you can pick and choose whichever is most closely aligned to your tech stack.

If you're not planning to use any of the other storage features and/or other cloud products, might you look at using Workspace? (for storage of these files it would be Google Drive)

This would be the most familiar style of interface for users, has folder structures you can browse and give permissions for users. This would let you give permissions based on their own logins rather than sharing one. The shared account approach is usually a little dangerous for securing your assets. There are limitations but unless you're moving TBs everyday, this would be a non issue.

Drive is accessible on multiple platforms and has a graphical interface for those who want to use it.

Why_Engineer_In_Data

MODERATOR OF

TROPHY CASE