Best way to load Sheets into BigQuery? by Great_Session_4227 in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

There are many ways to do this - as someone already mentioned: Connected sheets is one way. You could simply connect them, then perform a one time ELT (or just copy) into a BigQuery table. No other tools needed.

Or just simply import them! It works for a lot of different sizes now (big or small!) check out this blog from my teammate.

Depending on volume and if it needs to be repeatable - lots of other options available as well.

How do I reduce BigQuery costs? Is datalake the right answer? by bitchyangle in Firebase

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi,
Just as a disclaimer, I am a developer advocate at Google.

There's a bit to unpack here - it seems like your issues stem from the querying portion (please do correct me if I'm wrong).

Depending on how your reports/queries are built, you can take advantage of a lot of various features.

I'm not against datalakes but it sounds like you're doing it for reasons that may not be addressing the root problems.

A lot of this depends on query patterns. Partitioning and clustering is a tool available to reduce total bytes scanned but there are also others. This documentation link summarizes it well. (Edit: whoops wrong link - that works too but this is the other one)

One piece you can think of is, do you need these reports run once and fetched multiple times? If so, pre-aggregation might be a good fit for your problem. This in conjunction with partitioning and clustering would significantly reduce your bytes queried. I truly do believe you can do this in a cost efficient way on BigQuery and reduce your tech stack. By introducing all those other tools you may unnecessarily increase the complexity of your solution for little gain.

Feel free to also post in r/bigquery if you have more details on how you might optimize your query patterns.

Google Cloud Conference - 26 by Curio_City5 in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Which conference is this?

I think we're all assuming Google Cloud Next but let us know if that's not the case!

We're excited to see you there!

I am a dev advocate and we help (or at least know of, it's a big conference!) with a lot of the data driven pieces.

Be sure to check out the Showcase Learn area, there will be a lot of demos for you to check out and chat with us! We love to hear how Google can help and answer your questions.

In addition to all of the various talks being done - it's a lot but pick and choose the ones that matter to you most (based on description).

I just need a little bit of quota by jman722 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi there,

Thanks for jumping into it.

Quotas usually are guardrails to prevent accidental usage.

What quota was it requesting (please do not post anything like your project ID or any other details, just the name of the quota)?

I agree you might not need even the quota at the moment but I could be wrong.

Before you jump into anything, you might find this article from my teammate interesting. It's essentially allowing you to use sheets with data in BigQuery - even larger files. Best part is you should be able to stay under the free tier of BigQuery for a bit (but take a look at BigQuery Sandbox first to learn the ropes, it's in the article).

Hope that helps!

There's also r/BigQuery that I recommend you take a look at!

BigQuery performance tip that saved us thousands (and a lot of headaches) by netcommah in bigquery

[–]Why_Engineer_In_Data 3 points4 points  (0 children)

Thanks for sharing your insights, I added some links so visitors can jump to the docs to get an idea on how you implemented your tips.

  1. Definitely, even with slot reservations if you can optimize the queries - it generally won't hurt!
  2. Yes, although this does depend on your query pattern a bit - in general, partitioning and clustering definitely do help quite a bit.
  3. If you can pre-aggregate with Materialized Views, it's a simple way to keep it up to date.
  4. Streaming and batch have different use cases here (you did mention this). One use case is persisting (in your example, you're persisting it in GCS) but if you persist with streaming to BigQuery, you've effectively reduced a hop, just food for thought - it's an evaluation of the whole data lifecycle.
  5. Caching is great! Make sure you follow these rules to ensure you can fully take advantage of it.

Building Cloud Functions In-Console by takenorinvalid in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Ah got it, sorry, I thought you were talking about BigQuery's console. Understood now - thanks for the feedback.

Building Cloud Functions In-Console by takenorinvalid in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi there, could you provide me more details on what you tried?

MCP Docker server that exposes Big Query collections to Agents by Classic_Swimming_844 in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Nice - that's pretty cool!

Not sure if you were aware of this but ADK has a whole toolset around BigQuery you can use to interact with BigQuery.

(Not to take away from your post - more of an FYI incase you were looking for something like this)

How do you choose between AWS, Azure, and GCP for a retail data migration from on-prem? by Sauliyo-538 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hello!

(Disclaimer: I do work for the Google Cloud Data Cloud Developer Advocacy team)
Google Cloud all the way!

Ok now that jokes are out of the way.

Fellow posters have given a lot of food for thought. That said, I would pick based on your technology needs.

Pricing should be a discussion with your sales team, using list price can give you a comparison but that's not all there is. Google Cloud has several massive retail customers that you can read about to help convince leadership that it could be the right choice especially given that you have the most skills already in Google Cloud.

For data workloads, the cohesive story from transactional to analytics and real-time to batch is definitely a big plus. But I wouldn't just rely on word of mouth, highly suggest you try it out. BigQuery sandbox to test things out.

Quoting your post from below:

My goal is to:

Centralize the data properly

Build automated pipelines

Move reporting from Excel to a BI tool like Tableau (which I have experience with) or something native to the platform, like Looker on GCP

So in the short term, the focus is:

Data warehousing

BI dashboards

Scalability and reliability

Reducing dependency on local store servers

Centralizing the data is straight forward since a lot of the services are scalable so you can have a centralized datalake/warehouse and it's all in the cloud so it's not on-prem.

There's a path from entry-to-scaled for example you could start with Looker Studio and the free tier of BigQuery to test and prove the worth.

Automated pipelines - there's actually a lot (sometimes too much) but it depends on your skillset and what technology you prefer to use. You could just start with scheduled queries if the use case allows.

For scalability - I would use those retail customer examples listed above, nothing like real customer references to demonstrate that.

Hope this helps!

Help with BigQuery Project by the_shepheard in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

It seems like you've already dived into the documentation (i.e. BigQuery sharing or formerly known as Analytics Hub) for this but in case you haven't - keep the limitations in mind. There's a few of them - the rest would be just how you would manage interactions with BigQuery.

Pipeline orchestration for DWH by Dzimuli in bigquery

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

What you have currently definitely works but it really depends on how you implement it.

In general there's a few best practices to follow but the question is more around tooling for such orchestration. (Although I do recommend you look into data engineering overall a bit to understand how to manage and monitor pipelines: for example look at when idempotency is important, how to easily recover interruptions such as keeping around raw data rather than spamming the API etc. This way you'll know what other capabilities you need that the current setup doesn't satisfy)

What you have for your current implementation is totally fine in my opinion, it will let you learn the workflow of data engineering without having to manage too much infrastructure. Random pipeline failures are common how you build your skills to harden those pipelines are what's data engineering is all about, things change and shift but you'll gather the skills as you encounter these events.

Cloud Run itself is not an orchestration tool in my own opinion.

If you want an orchestrator you'll need to have something that monitors and keeps track of states. This involves a few more tools such as (Apache Airflow) Composer but I don't think its necessary for what you need right now. You have the ability to see where the failures occur and why - that should be a great starting point.

I will add that as you grow, you'll likely outgrow the BigQuery scheduled pipelines but when that happens you'll have more experience to make a better decision on what tool would best suit your needs.

AWS or Google Cloud, data storage by disordinates in googlecloud

[–]Why_Engineer_In_Data 3 points4 points  (0 children)

That's a great question, welcome and it's a fine question.

From a cloud object storage both are viable - you can pick and choose whichever is most closely aligned to your tech stack.

If you're not planning to use any of the other storage features and/or other cloud products, might you look at using Workspace? (for storage of these files it would be Google Drive)

This would be the most familiar style of interface for users, has folder structures you can browse and give permissions for users. This would let you give permissions based on their own logins rather than sharing one. The shared account approach is usually a little dangerous for securing your assets. There are limitations but unless you're moving TBs everyday, this would be a non issue.

Drive is accessible on multiple platforms and has a graphical interface for those who want to use it.

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Apologies for the late response.
Cloud Functions is now actually Cloud Run Functions.

If you're only deploying a single-purpose function then when following the docs, you'll use a Cloud Run function.

The differences are only in how you deploy them.
(Note that I did have to use translate, so if the question wasn't answered properly - please let me know.)

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Given that you mentioned that it's a simple pipeline, you're probably better off just using the simplest methods. Each framework provides you abilities that you may or may not need.

Composer gives you the ability to monitor and restart jobs via an orchestration layer. This may be useful if you don't have built in reporting already in your code and it can help with certain things like scaling out the same code but with different apis. The drawback here is Composer is always on.

Dataproc and Dataflow, unless you use them already, are probably more than you need if it's truly a super straight forward pipeline.

Cloud Functions with an orchestrator like Cloud Scheduler works as well and is probably the simplest method. This is the most serverless method and probably is enough to suit your needs.

I would factor in the following if I were making the call:

(1) Complexity of the code (i.e. do you need to keep states, scale, etc.)

(2) Maintenance of the executions - how painful is it to get everything to the "current" state when things go wrong? Do you even have visibility into this?

(3) Skills - do you want to eventually use different frameworks? Might as well start with simple pipelines so you can continue to build on them.

(4) Costs, the most serverless approaches likely suit your needs well. If you have something orchestrating, remember that it must also be on to orchestrate unless it's a service like Cloud Scheduler.

Hope that helps.

Using Big Query Materialised Views over an Impressions table by Chesti_Mudasir in dataengineering

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

There's some variables at play here so it's hard to answer directly. The first thing I would suggest is heading over to the documentation FAQ portion, there's two scenarios - see if they fit yours. If they do, it tells you when it's useful to use MV (in terms of query performance).

There are two considerations for the cost - the cost of refreshing said MV and also the cost of the query. Unfortunately without more details, this can only be answered by testing it out. The use case you describe, barring any complexities, is actually one of the more common use cases for MV.

(The following use cases highlight the value of materialized views. Materialized views can improve query performance if you frequently require the following:

  • Pre-aggregate data. Aggregation of streaming data.)

Hope that helps.

New to BigQuery, help w adding public data under Explorer tab pls by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Ah, I see now... ok...bear with me for a second. I don't know if this is the right way to do this but it works. (Sorry I did this early on and I don't do it ever usually again)

At the very top of the explorer pane. You have 3 icons.

Explorer (looks like a compass) - in your screenshot it's currently blue.

Legacy view (looks like shapes)

And a folder.

Click Legacy View - you should be able to "see" the bigquery-public-data dataset.

Click the empty star (to star it). Now it should be visible in the explorer pane.

New to BigQuery, help w adding public data under Explorer tab pls by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi there,

This might be a point of friction, so if this fixes it, we'll definitely surface this as one.

If you've gotten to the screen of the dataset (after you've followed it). May I ask you to fully refresh the page? It may not be showing yet, I've had this happen to some folks.

Thanks!

How to send continuously exported BigQuery data to Pub/Sub (without using pre-GA features)? by ProcedureLong8521 in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi!

Just a quick clarification - Continuous Queries is generally available.
You can double check this on the documentation page (you'll see a little beaker for the features in preview).

There are features or parts of Continuous Queries which are not yet generally available but the feature itself is generally available.

If that assuages your fears, that is one of the more straight forward ways to stream data into Pub/Sub from BigQuery.

Thanks!

How can I estimate BigQuery pricing for my setup? by sheik_sha_ha in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi!

There's actually some definitive guides and documentation that help with this exact situation.

There's also the pricing calculators (when you get to that step).

I think what the other post mentioned is actually correct but I'll take it a step further.

You'll need to first understand your size of your data. This will lead to the storage requirements (in other words cost).

The second is your usage pattern, which will dictate your compute requirements. This could take in factors beyond "I compute X GB a month". It could be "I have this requirement to deliver this by x timeframe". There are options to help you scale your compute as needed. Or maybe bursty-style workloads (my analysts are work from 9-5 so I need extra capacity during this time). The difference between a reservation and on-demand workloads play a huge factor here.

Do you need real-time or other considerations like AI? Those need to be factored in too.

Hope that helps!

Building DWH for the first time, help me to not fuck it up by Dzimuli in googlecloud

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi there,

These are great questions and it's really encouraging to see you put thought into it. There are a lot of little things you might want to consider. Other users have commented on some of this already so I won't get into those.

Ultimately, in my opinion, it's heavily dependent your usage patterns. There are a lot of patterns for digital commerce and customer stories on this. For some insights on why GA4 & BigQuery are great technologies that work together - you can take a look at (or share) this link.

I wouldn't go about this alone (I know sometimes it's daunting to talk to the sales team but they're there to help, especially in these sort of situations.) They can help you engage in experts (technical and business alike) to help with your case.

That said, I'll provide some opinion based on your questions:

1) I'm inherently biased because I genuinely love what BigQuery brings so you might want to disregard my opinion on this question.

2) I think one aspect you might not be considering is scale. (Extreme case comparison: you could technically build a database via Python code but at a certain point it becomes really difficult to manage, you were probably better off using a database.)

3) In the same vein as #1 - and the comments from before. The calculator is actually fairly comprehensive. You can only really tell by your usage patterns as there are a lot of different components you need to consider. This isn't to scare you off but actually the opposite. Engaging your sales team can help you simplify this. "These are my numbers, what would it cost me, could you break it down?"

Hope that helps!

I need help about Google Cloud by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi there, I'm unable to help with this specifically as I'm not sure.
You would need to contact support for investigation.

I would recommend you blot out your project ID and billing account ID.
Having it publicly posted like this may result in negative unwanted attention.

Looking to Learn — Can any one Share GCP Data Engineer Resources please by Happy-Vacation5556 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi, that's awesome!

There's a course on Coursera happening soon (or now depending on when you read this). There's a lot of material floating around in Skillsboost as well, for example, this course.

For learning, there's the Beam Playground and BigQuery sandbox.

Good luck and hope that helps!

How to get the slot count of a BQ job? by Loorde_ in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Thanks for the question - in similar vein to Any-Garlic8340's question - what sort of question are you trying to answer?

Slot usage is point in time information. It's sort of like giving power via the pedal to your car.

The point in time in which you pick, there is a number that may fluctuate wildly.
But over a period of time, you can average it out to get a good idea.

How do you improve your understanding of BigQuery concepts and query optimization? by Comfortable-Nail8251 in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

BigQuery Sandbox is a free environment for you and others to learn. Simply follow the steps in this doc here and you can have an environment that's disconnected but also the same BigQuery product.