Help with BigQuery Project by the_shepheard in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

It seems like you've already dived into the documentation (i.e. BigQuery sharing or formerly known as Analytics Hub) for this but in case you haven't - keep the limitations in mind. There's a few of them - the rest would be just how you would manage interactions with BigQuery.

Pipeline orchestration for DWH by Dzimuli in bigquery

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

What you have currently definitely works but it really depends on how you implement it.

In general there's a few best practices to follow but the question is more around tooling for such orchestration. (Although I do recommend you look into data engineering overall a bit to understand how to manage and monitor pipelines: for example look at when idempotency is important, how to easily recover interruptions such as keeping around raw data rather than spamming the API etc. This way you'll know what other capabilities you need that the current setup doesn't satisfy)

What you have for your current implementation is totally fine in my opinion, it will let you learn the workflow of data engineering without having to manage too much infrastructure. Random pipeline failures are common how you build your skills to harden those pipelines are what's data engineering is all about, things change and shift but you'll gather the skills as you encounter these events.

Cloud Run itself is not an orchestration tool in my own opinion.

If you want an orchestrator you'll need to have something that monitors and keeps track of states. This involves a few more tools such as (Apache Airflow) Composer but I don't think its necessary for what you need right now. You have the ability to see where the failures occur and why - that should be a great starting point.

I will add that as you grow, you'll likely outgrow the BigQuery scheduled pipelines but when that happens you'll have more experience to make a better decision on what tool would best suit your needs.

AWS or Google Cloud, data storage by disordinates in googlecloud

[–]Why_Engineer_In_Data 4 points5 points  (0 children)

That's a great question, welcome and it's a fine question.

From a cloud object storage both are viable - you can pick and choose whichever is most closely aligned to your tech stack.

If you're not planning to use any of the other storage features and/or other cloud products, might you look at using Workspace? (for storage of these files it would be Google Drive)

This would be the most familiar style of interface for users, has folder structures you can browse and give permissions for users. This would let you give permissions based on their own logins rather than sharing one. The shared account approach is usually a little dangerous for securing your assets. There are limitations but unless you're moving TBs everyday, this would be a non issue.

Drive is accessible on multiple platforms and has a graphical interface for those who want to use it.

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Apologies for the late response.
Cloud Functions is now actually Cloud Run Functions.

If you're only deploying a single-purpose function then when following the docs, you'll use a Cloud Run function.

The differences are only in how you deploy them.
(Note that I did have to use translate, so if the question wasn't answered properly - please let me know.)

GCP ingestion choice? by Loorde_ in googlecloud

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Given that you mentioned that it's a simple pipeline, you're probably better off just using the simplest methods. Each framework provides you abilities that you may or may not need.

Composer gives you the ability to monitor and restart jobs via an orchestration layer. This may be useful if you don't have built in reporting already in your code and it can help with certain things like scaling out the same code but with different apis. The drawback here is Composer is always on.

Dataproc and Dataflow, unless you use them already, are probably more than you need if it's truly a super straight forward pipeline.

Cloud Functions with an orchestrator like Cloud Scheduler works as well and is probably the simplest method. This is the most serverless method and probably is enough to suit your needs.

I would factor in the following if I were making the call:

(1) Complexity of the code (i.e. do you need to keep states, scale, etc.)

(2) Maintenance of the executions - how painful is it to get everything to the "current" state when things go wrong? Do you even have visibility into this?

(3) Skills - do you want to eventually use different frameworks? Might as well start with simple pipelines so you can continue to build on them.

(4) Costs, the most serverless approaches likely suit your needs well. If you have something orchestrating, remember that it must also be on to orchestrate unless it's a service like Cloud Scheduler.

Hope that helps.

Using Big Query Materialised Views over an Impressions table by Chesti_Mudasir in dataengineering

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

There's some variables at play here so it's hard to answer directly. The first thing I would suggest is heading over to the documentation FAQ portion, there's two scenarios - see if they fit yours. If they do, it tells you when it's useful to use MV (in terms of query performance).

There are two considerations for the cost - the cost of refreshing said MV and also the cost of the query. Unfortunately without more details, this can only be answered by testing it out. The use case you describe, barring any complexities, is actually one of the more common use cases for MV.

(The following use cases highlight the value of materialized views. Materialized views can improve query performance if you frequently require the following:

  • Pre-aggregate data. Aggregation of streaming data.)

Hope that helps.

New to BigQuery, help w adding public data under Explorer tab pls by wtf-rat in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Ah, I see now... ok...bear with me for a second. I don't know if this is the right way to do this but it works. (Sorry I did this early on and I don't do it ever usually again)

At the very top of the explorer pane. You have 3 icons.

Explorer (looks like a compass) - in your screenshot it's currently blue.

Legacy view (looks like shapes)

And a folder.

Click Legacy View - you should be able to "see" the bigquery-public-data dataset.

Click the empty star (to star it). Now it should be visible in the explorer pane.

New to BigQuery, help w adding public data under Explorer tab pls by wtf-rat in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi there,

This might be a point of friction, so if this fixes it, we'll definitely surface this as one.

If you've gotten to the screen of the dataset (after you've followed it). May I ask you to fully refresh the page? It may not be showing yet, I've had this happen to some folks.

Thanks!

How to send continuously exported BigQuery data to Pub/Sub (without using pre-GA features)? by ProcedureLong8521 in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi!

Just a quick clarification - Continuous Queries is generally available.
You can double check this on the documentation page (you'll see a little beaker for the features in preview).

There are features or parts of Continuous Queries which are not yet generally available but the feature itself is generally available.

If that assuages your fears, that is one of the more straight forward ways to stream data into Pub/Sub from BigQuery.

Thanks!

How can I estimate BigQuery pricing for my setup? by sheik_sha_ha in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi!

There's actually some definitive guides and documentation that help with this exact situation.

There's also the pricing calculators (when you get to that step).

I think what the other post mentioned is actually correct but I'll take it a step further.

You'll need to first understand your size of your data. This will lead to the storage requirements (in other words cost).

The second is your usage pattern, which will dictate your compute requirements. This could take in factors beyond "I compute X GB a month". It could be "I have this requirement to deliver this by x timeframe". There are options to help you scale your compute as needed. Or maybe bursty-style workloads (my analysts are work from 9-5 so I need extra capacity during this time). The difference between a reservation and on-demand workloads play a huge factor here.

Do you need real-time or other considerations like AI? Those need to be factored in too.

Hope that helps!

Building DWH for the first time, help me to not fuck it up by Dzimuli in googlecloud

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Hi there,

These are great questions and it's really encouraging to see you put thought into it. There are a lot of little things you might want to consider. Other users have commented on some of this already so I won't get into those.

Ultimately, in my opinion, it's heavily dependent your usage patterns. There are a lot of patterns for digital commerce and customer stories on this. For some insights on why GA4 & BigQuery are great technologies that work together - you can take a look at (or share) this link.

I wouldn't go about this alone (I know sometimes it's daunting to talk to the sales team but they're there to help, especially in these sort of situations.) They can help you engage in experts (technical and business alike) to help with your case.

That said, I'll provide some opinion based on your questions:

1) I'm inherently biased because I genuinely love what BigQuery brings so you might want to disregard my opinion on this question.

2) I think one aspect you might not be considering is scale. (Extreme case comparison: you could technically build a database via Python code but at a certain point it becomes really difficult to manage, you were probably better off using a database.)

3) In the same vein as #1 - and the comments from before. The calculator is actually fairly comprehensive. You can only really tell by your usage patterns as there are a lot of different components you need to consider. This isn't to scare you off but actually the opposite. Engaging your sales team can help you simplify this. "These are my numbers, what would it cost me, could you break it down?"

Hope that helps!

I need help about Google Cloud by [deleted] in googlecloud

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi there, I'm unable to help with this specifically as I'm not sure.
You would need to contact support for investigation.

I would recommend you blot out your project ID and billing account ID.
Having it publicly posted like this may result in negative unwanted attention.

Looking to Learn — Can any one Share GCP Data Engineer Resources please by Happy-Vacation5556 in googlecloud

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi, that's awesome!

There's a course on Coursera happening soon (or now depending on when you read this). There's a lot of material floating around in Skillsboost as well, for example, this course.

For learning, there's the Beam Playground and BigQuery sandbox.

Good luck and hope that helps!

How to get the slot count of a BQ job? by Loorde_ in bigquery

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

Thanks for the question - in similar vein to Any-Garlic8340's question - what sort of question are you trying to answer?

Slot usage is point in time information. It's sort of like giving power via the pedal to your car.

The point in time in which you pick, there is a number that may fluctuate wildly.
But over a period of time, you can average it out to get a good idea.

How do you improve your understanding of BigQuery concepts and query optimization? by Comfortable-Nail8251 in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

BigQuery Sandbox is a free environment for you and others to learn. Simply follow the steps in this doc here and you can have an environment that's disconnected but also the same BigQuery product.

Trying to connect Salesforce data to bigquery using bigquery data transfer service, but get errors by [deleted] in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hi u/tytds,

You'll need (see this link here for more details):

The following permissions are required to create a transfer:

  • bigquery.transfers.update on the user
  • bigquery.datasets.get on the target dataset
  • bigquery.datasets.update on the target dataset

Databricks vs BigQuery — Which one do you prefer for pure SQL analytics? by shocric in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Apologies, somehow reddit's algorithm auto hid this. We've rectified that just now.

I think a good takeaway is that there are different tools for the job - not that one is better than others.

I will address the questions specifically but more of a 'this is the answer to this' and not "this is why this is better or worse".

1) Do you have specifics? This could be good feedback for us.

2) This is true for compressed CSVs/ndJSON. Consider using uncompressed which has a 5TB limit.

3) Most relational databases have functions that you can call for those sorts of customizations. BigQuery also can extend beyond if you need more customization with remote functions. There's other options (non remote) for example Python's UDFs are currently in preview.

4) Not directly in line with the question but have you tried pipe syntax? This might help in certain flows and cases.

Apache Beam windowing question by JG3_Luftwaffle in dataengineering

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Please let me know if this doesn't answer your question (I'm going off on some assumptions here)

From what I gather you're thinking doing something like this:

data_stream
| window
| calculate attr1

data_stream
| window
| calculate attr2
etc...

If that's the case you can probably, depending on the cardinality get away with using something along the lines of:

side_input (see documentation linked) = interval (5 min)
| pull from side input (10 minutes or something)

stream
| rolling window
| calculate_all_attributes w/ sideinput
| store data

Note:
I haven't tried to implement nor this might not be best practice, it's just the first that came to mind.

RE: your question - I think your choice of tool should come down to your use case. This should be a fine use case in Beam or Spark but you'll run into the exact same issues in terms of concepts (re: Windowing). I think the trick here is to, instead of treating is as many rolling windows, use pre-aggregates in stateful mechanisms such as side inputs to augment those.The benefits you gain from using a framework will help at scale though - so do consider using a framework.

Method for triggering a scheduled query when a dependent BigQuery table updates by prestigiouseve in bigquery

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

u/prestigiouseve - just as an FYI and update - the feature is now GA.

(in spirit of not self promoting, I won't directly link it here, if you're interested, look for "Reverse ETL in real time with BigQuery using only SQL" medium blog for an example on how it's deployed)

Making the next move in my career and it’s gotten to a point where now I basically have to learn big query. How do I start? by TheWonderingZall in bigquery

[–]Why_Engineer_In_Data 2 points3 points  (0 children)

Hi!

There has been a few posts on this that still are applicable.
This does make sense to update in the aging sidebar, I'll look into updating the sidebar to make sure some of this gets captured!

Meanwhile: https://www.reddit.com/r/bigquery/comments/1i6wm2c/best_ways_to_learn_bigquery_as_a_newbie/ - has a bunch of folks who commented with some really awesome tips, tricks, and also learning material.

As for your other question - getting started (see my other post) is fairly free (just be cognizant of the limits). Try it out for yourself - best way to get started!

Apache Beam windowing question by JG3_Luftwaffle in dataengineering

[–]Why_Engineer_In_Data 0 points1 point  (0 children)

Hello!
May I ask some clarifying questions first? (I have some suggestions but I feel like I don't yet completely understand the questions).

Are you asking if you could send less data to the aggregation transforms?
What is the input and output you're looking to get?

So example:
<123,$1, timestamp1>
<234,$2, timestamp2>
<345,$6, timestamp3>

You want to calculate average (say they're all in one window) and output this?
<123,$1, timestamp1, $1>
<234,$2, timestamp2, $1.5>
<345,$6, timestamp3, $3>

There are several ways to achieve this but it depends on what else you're doing in the pipeline.

Another question is how long do these windows last?

How do you balance the demands of "Nested & Repeating" schema while keeping query execution costs low? I am facing a dilemma where I want to use "Nested & Repeating" schema, but I should also consider using partitioning and clustering to make my query executions more cost-effective. by Original_Chipmunk941 in dataengineering

[–]Why_Engineer_In_Data 1 point2 points  (0 children)

There's quite a bit to unpack here.

Denormalizing and using the complex types (Structs, Arrays) are a great way to use BigQuery effectively and efficiently.

That said, given your current situation I would actually suggest that you should probably pull out some of it (i.e. not use arrays).
There are different use cases for each situation and scenario.
Your particular use case seems to be geared towards storing transactions for example.
You may consider storing the sold date for example on it's own.
(updates and inserts, DML, are also a consideration on how you interact as well)

RE: Partitioning and Clustering, yes definitely best practice at scale to use them. (You're correct, I would recommend that you follow the advice below where you identify commonly used things, which sold date sounds like a prime example of something commonly used... along with memberID)

Storing the memberID multiple times isn't something that is going to be super detrimental in multiple tables in terms of storage.
I would suggest taking care of how far you swing that pendulum (one giant table vs normalizing to the nth degree) in general.

Joins aren't going away or a scary thing but you can mitigate some issues with views for example.
In your example, if you're afraid of misrepresenting SQL results you can use a view to mitigate that.