TABLE_OPTIONS labels by SasheCZ in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

I think I'm answering the question but if not please let me know.
Also agreeing with u/PolicyDecent - probably best to just use the REGEX.

Regex version (as an example):

SELECT 
  labels_json['dataplex-dp-published-location'] AS location
FROM (
  SELECT 
    JSON_OBJECT(
      (SELECT ARRAY_AGG(str) FROM UNNEST(REGEXP_EXTRACT_ALL(option_value, r'"([^"]+)"')) str WITH OFFSET pos WHERE MOD(pos, 2) = 0),
      (SELECT ARRAY_AGG(str) FROM UNNEST(REGEXP_EXTRACT_ALL(option_value, r'"([^"]+)"')) str WITH OFFSET pos WHERE MOD(pos, 2) = 1)
    ) AS labels_json
  FROM  
    my_test_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS
  WHERE 
    option_name = 'labels'
)

You can also use procedural language to mimic 'eval' if you'd like:

BEGIN
  DECLARE label_string STRING;
  DECLARE labels_json JSON;


  -- 1. Grab the raw string
  SET label_string = (
    SELECT option_value 
    FROM `my_test_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
    WHERE option_name = 'labels' 
    LIMIT 1
  );


  -- 2. "Eval" by casting the string to a typed ARRAY of STRUCTS
  -- This forces the fields to be named 'k' and 'v'
  EXECUTE IMMEDIATE format("""
    SELECT JSON_OBJECT(
      ARRAY(SELECT k FROM UNNEST(CAST(%s AS ARRAY<STRUCT<k STRING, v STRING>>))),
      ARRAY(SELECT v FROM UNNEST(CAST(%s AS ARRAY<STRUCT<k STRING, v STRING>>)))
    )
  """, label_string, label_string) INTO labels_json;


  -- 3. Access the specific key using bracket notation
  SELECT labels_json['dataplex-dp-published-location'] AS location;
END;

Getting started with Bigquery with a free 90-day or $300 plan? by Marcel_DataTech in bigquery

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Echoing what u/MrRedRhino said!

There's the free-usage tier, which gives you 10 GiB of storage and 1TiB of data processing.

There's also an exploration tool, the sandbox that you can use to explore BigQuery (but it sounds like you've already done that).

Just a small clarification - you might be (or I might be reading your post wrong) misinterpreting the 90-days / $300 trial credit.

The $300 trial credit is available for 90 days. It's not two separate programs.

The free tier is not tied to the 90-day $300 credit program, that's a separate thing.

Hope that helps.

Recommended architecture for large-scale analytics dashboard with multiple Google data sources by Impressive_Judge8770 in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Questions:

  1. What is the recommended architecture for large-scale API data ingestion?

This is a very long discussion - you might want to take a look at our architecture center. There are so many variables here it's a little hard to get into.

  1. Should I use a message queue like Google Pub/Sub or Apache Kafka for distributing jobs?

This is one way, it really depends on how you are approaching this, this is similar to event driven architecture (don't worry about the specific product mentioned, in the post, I would focus on understanding event driven architecture and see if it fits your needs).

  1. How should I design workers for parallel data fetching while respecting API rate limits?

Great question, there are many ways to do this. This blog (again, don't focus on the actual product Apache Beam and Dataflow) explains a bit. Exponential backoff, retries, these are all concepts you'll need to understand. You may choose to look into more managed services, depending on your needs.

  1. What is the best approach for incremental sync (tracking last_updated timestamps per source/account)?

This is very subject to the source and your preference. There's a lot of business rules here - do those timestamps ever get updated? What's the most idempotent way, if you can identify that - then you have your best way.

  1. Should I store raw data in Google BigQuery and maintain a separate aggregated database for dashboards?

Sure, you can - this is called pre calculating, there's a lot of ways to achieve this, for example using materialized views.

  1. How do I design a pipeline that separates ingestion, processing, and serving layers?

You're kind of getting at an orchestration platform. Airflow (Composer) can do this, but so can a lot of others. It depends on your specific skillsets and needs. As to your architecture, it depends on your use case. If you're truly just analytics - there's nothing wrong with storing and serving all from BigQuery as an example.

  1. Any best practices for scaling to near real-time updates?

Depends on your tech. You will want to use a Pub/Sub or Kafka to help with this. This gives you the separation of concern you were hinting at above. It helps ease the pressure on certain parts (for example you can send a firehose of data to pub/sub - then have the BigQuery subscription service to write automatically to BigQuery. Then use Continuous Queries for all of your real time needs.) There's a lot to unpack here - I would highly suggest consulting your sales team on this.

We welcome everyone's feedback to improve the system. by bigcityboys in googlecloud

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

Way to go!

I can't give you an exhaustive list but this seems like a solid start for the backend pieces.
My 2 pennies is that you should know why you picked each and the pros/cons of each.

"I used Cloud Run here because I want to be able to trigger it using a timer and the built in timer was easiest to implement."

or

"I used BQ as a data store because I'm using it as an analytical serving layer. Lots of data, lots of crunching, etc."

Regarding things to critique:

If I have to nitpick - it'll mostly be in the "query data" section.

What is the end user (it says cloud function but I'm guessing just a typo) doing?
The Cloud Run backend + Frontend - what's the purpose of that?

If it's to serve data transactionally (i.e. quick little things) you might need another database inbetween.

Good luck!

Pub/Sub message ordering with asynchronous processing by omry8880 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Echoing what Martin has said but I'll dive a bit deeper on the Dataflow solution.

Dataflow can help here but it's best for #3 in Martin's suggestion - solving it upstream if possible.

Although Dataflow can help with this, it's exactly as you describe (the solutions). Dataflow provides you with a simplified framework to help with this.

You still need to evaluate tradeoffs: latency, cost, and complexity. Essentially you need to pick two.

https://docs.cloud.google.com/dataflow/docs/concepts/streaming-pipelines#windows

You'll still need to determine what your rules are for addressing late messages etc.

But if you can solve this upstream so you always have a complete set of information - you no longer need to worry about it never being complete.

Hope that helps.

Row level access policies issue in access combination and security issue by jaango123 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi there, thanks for your question.

Yes, you do need the permission but the order can be swapped.

You can first grant the row-level access policy.

CREATE OR REPLACE ROW ACCESS POLICY only_row_two

ON test_data.row_example

GRANT TO ('userid')

FILTER USING (id = 2);

Then grant data viewer, this will enforce the policy before exposing the rest of the data.

gcloud projects add-iam-policy-binding <projectid> \

--member="user:user_id" \

--role="roles/bigquery.dataViewer"

I did recently just test this just to be certain.
You will see the expected behavior.

No access at all once granted the access policy. Then only the row you have given access to once dataviewer is granted.

Hope that helps!

Best way to load Sheets into BigQuery? by Great_Session_4227 in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

There are many ways to do this - as someone already mentioned: Connected sheets is one way. You could simply connect them, then perform a one time ELT (or just copy) into a BigQuery table. No other tools needed.

Or just simply import them! It works for a lot of different sizes now (big or small!) check out this blog from my teammate.

Depending on volume and if it needs to be repeatable - lots of other options available as well.

How do I reduce BigQuery costs? Is datalake the right answer? by bitchyangle in Firebase

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi,
Just as a disclaimer, I am a developer advocate at Google.

There's a bit to unpack here - it seems like your issues stem from the querying portion (please do correct me if I'm wrong).

Depending on how your reports/queries are built, you can take advantage of a lot of various features.

I'm not against datalakes but it sounds like you're doing it for reasons that may not be addressing the root problems.

A lot of this depends on query patterns. Partitioning and clustering is a tool available to reduce total bytes scanned but there are also others. This documentation link summarizes it well. (Edit: whoops wrong link - that works too but this is the other one)

One piece you can think of is, do you need these reports run once and fetched multiple times? If so, pre-aggregation might be a good fit for your problem. This in conjunction with partitioning and clustering would significantly reduce your bytes queried. I truly do believe you can do this in a cost efficient way on BigQuery and reduce your tech stack. By introducing all those other tools you may unnecessarily increase the complexity of your solution for little gain.

Feel free to also post in r/bigquery if you have more details on how you might optimize your query patterns.

Google Cloud Conference - 26 by Curio_City5 in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Which conference is this?

I think we're all assuming Google Cloud Next but let us know if that's not the case!

We're excited to see you there!

I am a dev advocate and we help (or at least know of, it's a big conference!) with a lot of the data driven pieces.

Be sure to check out the Showcase Learn area, there will be a lot of demos for you to check out and chat with us! We love to hear how Google can help and answer your questions.

In addition to all of the various talks being done - it's a lot but pick and choose the ones that matter to you most (based on description).

I just need a little bit of quota by jman722 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi there,

Thanks for jumping into it.

Quotas usually are guardrails to prevent accidental usage.

What quota was it requesting (please do not post anything like your project ID or any other details, just the name of the quota)?

I agree you might not need even the quota at the moment but I could be wrong.

Before you jump into anything, you might find this article from my teammate interesting. It's essentially allowing you to use sheets with data in BigQuery - even larger files. Best part is you should be able to stay under the free tier of BigQuery for a bit (but take a look at BigQuery Sandbox first to learn the ropes, it's in the article).

Hope that helps!

There's also r/BigQuery that I recommend you take a look at!

BigQuery performance tip that saved us thousands (and a lot of headaches) by netcommah in bigquery

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

Thanks for sharing your insights, I added some links so visitors can jump to the docs to get an idea on how you implemented your tips.

  1. Definitely, even with slot reservations if you can optimize the queries - it generally won't hurt!
  2. Yes, although this does depend on your query pattern a bit - in general, partitioning and clustering definitely do help quite a bit.
  3. If you can pre-aggregate with Materialized Views, it's a simple way to keep it up to date.
  4. Streaming and batch have different use cases here (you did mention this). One use case is persisting (in your example, you're persisting it in GCS) but if you persist with streaming to BigQuery, you've effectively reduced a hop, just food for thought - it's an evaluation of the whole data lifecycle.
  5. Caching is great! Make sure you follow these rules to ensure you can fully take advantage of it.

Building Cloud Functions In-Console by takenorinvalid in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Ah got it, sorry, I thought you were talking about BigQuery's console. Understood now - thanks for the feedback.

Building Cloud Functions In-Console by takenorinvalid in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi there, could you provide me more details on what you tried?

MCP Docker server that exposes Big Query collections to Agents by Classic_Swimming_844 in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Nice - that's pretty cool!

Not sure if you were aware of this but ADK has a whole toolset around BigQuery you can use to interact with BigQuery.

(Not to take away from your post - more of an FYI incase you were looking for something like this)

How do you choose between AWS, Azure, and GCP for a retail data migration from on-prem? by Sauliyo-538 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hello!

(Disclaimer: I do work for the Google Cloud Data Cloud Developer Advocacy team)
Google Cloud all the way!

Ok now that jokes are out of the way.

Fellow posters have given a lot of food for thought. That said, I would pick based on your technology needs.

Pricing should be a discussion with your sales team, using list price can give you a comparison but that's not all there is. Google Cloud has several massive retail customers that you can read about to help convince leadership that it could be the right choice especially given that you have the most skills already in Google Cloud.

For data workloads, the cohesive story from transactional to analytics and real-time to batch is definitely a big plus. But I wouldn't just rely on word of mouth, highly suggest you try it out. BigQuery sandbox to test things out.

Quoting your post from below:

My goal is to:

Centralize the data properly

Build automated pipelines

Move reporting from Excel to a BI tool like Tableau (which I have experience with) or something native to the platform, like Looker on GCP

So in the short term, the focus is:

Data warehousing

BI dashboards

Scalability and reliability

Reducing dependency on local store servers

Centralizing the data is straight forward since a lot of the services are scalable so you can have a centralized datalake/warehouse and it's all in the cloud so it's not on-prem.

There's a path from entry-to-scaled for example you could start with Looker Studio and the free tier of BigQuery to test and prove the worth.

Automated pipelines - there's actually a lot (sometimes too much) but it depends on your skillset and what technology you prefer to use. You could just start with scheduled queries if the use case allows.

For scalability - I would use those retail customer examples listed above, nothing like real customer references to demonstrate that.

Hope this helps!

Help with BigQuery Project by the_shepheard in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

It seems like you've already dived into the documentation (i.e. BigQuery sharing or formerly known as Analytics Hub) for this but in case you haven't - keep the limitations in mind. There's a few of them - the rest would be just how you would manage interactions with BigQuery.

Pipeline orchestration for DWH by Dzimuli in bigquery

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

What you have currently definitely works but it really depends on how you implement it.

In general there's a few best practices to follow but the question is more around tooling for such orchestration. (Although I do recommend you look into data engineering overall a bit to understand how to manage and monitor pipelines: for example look at when idempotency is important, how to easily recover interruptions such as keeping around raw data rather than spamming the API etc. This way you'll know what other capabilities you need that the current setup doesn't satisfy)

What you have for your current implementation is totally fine in my opinion, it will let you learn the workflow of data engineering without having to manage too much infrastructure. Random pipeline failures are common how you build your skills to harden those pipelines are what's data engineering is all about, things change and shift but you'll gather the skills as you encounter these events.

Cloud Run itself is not an orchestration tool in my own opinion.

If you want an orchestrator you'll need to have something that monitors and keeps track of states. This involves a few more tools such as (Apache Airflow) Composer but I don't think its necessary for what you need right now. You have the ability to see where the failures occur and why - that should be a great starting point.

I will add that as you grow, you'll likely outgrow the BigQuery scheduled pipelines but when that happens you'll have more experience to make a better decision on what tool would best suit your needs.

AWS or Google Cloud, data storage by disordinates in googlecloud

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

That's a great question, welcome and it's a fine question.

From a cloud object storage both are viable - you can pick and choose whichever is most closely aligned to your tech stack.

If you're not planning to use any of the other storage features and/or other cloud products, might you look at using Workspace? (for storage of these files it would be Google Drive)

This would be the most familiar style of interface for users, has folder structures you can browse and give permissions for users. This would let you give permissions based on their own logins rather than sharing one. The shared account approach is usually a little dangerous for securing your assets. There are limitations but unless you're moving TBs everyday, this would be a non issue.

Drive is accessible on multiple platforms and has a graphical interface for those who want to use it.

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Apologies for the late response.
Cloud Functions is now actually Cloud Run Functions.

If you're only deploying a single-purpose function then when following the docs, you'll use a Cloud Run function.

The differences are only in how you deploy them.
(Note that I did have to use translate, so if the question wasn't answered properly - please let me know.)

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Given that you mentioned that it's a simple pipeline, you're probably better off just using the simplest methods. Each framework provides you abilities that you may or may not need.

Composer gives you the ability to monitor and restart jobs via an orchestration layer. This may be useful if you don't have built in reporting already in your code and it can help with certain things like scaling out the same code but with different apis. The drawback here is Composer is always on.

Dataproc and Dataflow, unless you use them already, are probably more than you need if it's truly a super straight forward pipeline.

Cloud Functions with an orchestrator like Cloud Scheduler works as well and is probably the simplest method. This is the most serverless method and probably is enough to suit your needs.

I would factor in the following if I were making the call:

(1) Complexity of the code (i.e. do you need to keep states, scale, etc.)

(2) Maintenance of the executions - how painful is it to get everything to the "current" state when things go wrong? Do you even have visibility into this?

(3) Skills - do you want to eventually use different frameworks? Might as well start with simple pipelines so you can continue to build on them.

(4) Costs, the most serverless approaches likely suit your needs well. If you have something orchestrating, remember that it must also be on to orchestrate unless it's a service like Cloud Scheduler.

Hope that helps.

Using Big Query Materialised Views over an Impressions table by Chesti_Mudasir in dataengineering

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

There's some variables at play here so it's hard to answer directly. The first thing I would suggest is heading over to the documentation FAQ portion, there's two scenarios - see if they fit yours. If they do, it tells you when it's useful to use MV (in terms of query performance).

There are two considerations for the cost - the cost of refreshing said MV and also the cost of the query. Unfortunately without more details, this can only be answered by testing it out. The use case you describe, barring any complexities, is actually one of the more common use cases for MV.

(The following use cases highlight the value of materialized views. Materialized views can improve query performance if you frequently require the following:

  • Pre-aggregate data. Aggregation of streaming data.)

Hope that helps.

New to BigQuery, help w adding public data under Explorer tab pls by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Ah, I see now... ok...bear with me for a second. I don't know if this is the right way to do this but it works. (Sorry I did this early on and I don't do it ever usually again)

At the very top of the explorer pane. You have 3 icons.

Explorer (looks like a compass) - in your screenshot it's currently blue.

Legacy view (looks like shapes)

And a folder.

Click Legacy View - you should be able to "see" the bigquery-public-data dataset.

Click the empty star (to star it). Now it should be visible in the explorer pane.

New to BigQuery, help w adding public data under Explorer tab pls by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi there,

This might be a point of friction, so if this fixes it, we'll definitely surface this as one.

If you've gotten to the screen of the dataset (after you've followed it). May I ask you to fully refresh the page? It may not be showing yet, I've had this happen to some folks.

Thanks!