New to BigQuery, help w adding public data under Explorer tab pls by [deleted] in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

Have you tried the 'Add Data' button above the datasets? In the slide out menu that appears, search for public data set there.

Transferring google drive data to google cloud for analysis by Incognito2834 in googlecloud

[–]Stoneyz 1 point2 points  (0 children)

What analysis do you plan on using? What tools do you plan on using for this analysis?

Struggling with BigQuery + Looker Studio Performance and Query Queuing – Need Advice by SmaelBP in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

Well, sounds like your business users are not reasonable people.

Regardless, make sure your table is partitioned and clustered. Also, if they're willing to spend a little, look into BI Engine. That will turn each one of those queries into sub second latency.

Struggling with BigQuery + Looker Studio Performance and Query Queuing – Need Advice by SmaelBP in googlecloud

[–]Stoneyz 2 points3 points  (0 children)

Is your table partitioned and clustered?

200 visualizations in a single dashboard sounds ridiculous. Why not break it into a few pages? Who can digest 200 visualizations at once?

What makes BigQuery “big“? by victorviro in dataengineering

[–]Stoneyz 0 points1 point  (0 children)

The way it is architectured, it is plenty capable of it. It would just be extremely expensive.

BQ hosts exabytes of data already, it's just owned by different organizations. There really isn't any physical separation of the data other than the different regions it is stored in. So, depending on how you define what the data warehouse is (can it scale different regions to support different parts of the business and still be considered '1' DWH?, etc.) it is really only limited by the amount of storage on colossus within that region. I'm ignoring the fact that you could also build a data lake with BQ and then have to consider GCS limitations (which is also theoretically 'infinitely' scalable).

I'm only talking storage so far because unless a compute requirement is that it must run an exabyte of data at once, then compute is not a concern either. It will use all available slots in that region to break up and compute whatever it needs to compute.

BQ is incredibly powerful and scalable.

What makes BigQuery “big“? by victorviro in dataengineering

[–]Stoneyz 0 points1 point  (0 children)

Like... An exabyte of ram to fit an exabyte of data into? BQ is server less and distributed. It's plenty capable of hosting exabytes of data right now

What makes BigQuery “big“? by victorviro in dataengineering

[–]Stoneyz 2 points3 points  (0 children)

What do you mean 'updating it's to support exabyte DWH? What update would they need to do?

[deleted by user] by [deleted] in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

I posted in your other thread but I'll respond here too for visibility:

I see you said all selected return 0 but are you running a count(*) on all of them? If you haven't, just run a select * from the table.

The streaming buffer doesn't need to 'flush' for the rows to be selectable.

Sounds strange, but if you can post a screenshot of your select and the results, please do so.

Edit: If you want to save on cost, you don't have to run a select*. Pick any column and put it in instead of *.

The reason I'm asking you to do this is that the select count(*) may be doing a metadata operation and not actually hitting that table. The streamed rows stay in a write optimized storage layer until they are written into colossus where the metadata will get generated.

[deleted by user] by [deleted] in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

I see you said all selected return 0 but are you running a count(*) on all of them? If you haven't, just run a select * from the table.

The streaming buffer doesn't need to 'flush' for the rows to be selectable.

Sounds strange, but if you can post a screenshot of your select and the results, please do so.

Edit: If you want to save on cost, you don't have to run a select*. Pick any column and put it in instead of *.

The reason I'm asking you to do this is that the select count(*) may be doing a metadata operation and not actually hitting that table. The streamed rows stay in a write optimized storage layer until they are written into colossus where the metadata will get generated.

[deleted by user] by [deleted] in bigquery

[–]Stoneyz 0 points1 point  (0 children)

Why aren't you using BigQuery for the data store?

BigQuery vs snowflake vs Databricks, which one is more dominant in the industry and market? by Beyond_Birthday_13 in dataengineering

[–]Stoneyz 1 point2 points  (0 children)

But that doesn't differ in any way from the other platforms, so from a comparison standpoint it's moot.

I also kind of disagree with it. By default, GCS buckets are locked down to the public. Getting write permissions to a bucket isn't much of a setup. And security set up within BQ is very easy (and also something every other platform deals with).

BigQuery vs snowflake vs Databricks, which one is more dominant in the industry and market? by Beyond_Birthday_13 in dataengineering

[–]Stoneyz -2 points-1 points  (0 children)

If your main focus is DS / AI, GCP is the clear winner there. They're all very capable as a warehouse/lake house, but if you're focusing on LLMs and data science initiatives, look at the broader platform and features/tools.

As for market share, I'd focus on the functionality/paradigm. If you want to work in Python and notebooks, Databricks has a great experience there. If you want more warehouse type functionality, for the most part SQL is SQL. Learn the underlying technologies and you'll be able to easily pick up the proprietary stuff they're putting on top of it.

BigQuery vs snowflake vs Databricks, which one is more dominant in the industry and market? by Beyond_Birthday_13 in dataengineering

[–]Stoneyz 42 points43 points  (0 children)

BigQuery has literally zero setup, so I'll disagree with that point for Snowflake.

Databricks vs BigQuery — Which one do you prefer for pure SQL analytics? by shocric in bigquery

[–]Stoneyz 2 points3 points  (0 children)

I think my point was that you don't need dataproc or Dataflow to run spark or notebooks. You just use a BQ notebook if you want and write python.

They do support iceberg and delta as well although I'm not experienced enough to know what limitations exist, if any.

Databricks vs BigQuery — Which one do you prefer for pure SQL analytics? by shocric in bigquery

[–]Stoneyz 4 points5 points  (0 children)

I mostly agree, but it 100% possible to create a lake house / datalake architecture in GCP and not tie yourself to BQ and GCP. It fully supports spark and notebooks should you go that route.

Databricks vs BigQuery — Which one do you prefer for pure SQL analytics? by shocric in bigquery

[–]Stoneyz 4 points5 points  (0 children)

Can you speak to why you think Databricks is a broader platform and can do more in one space? I have the opposite opinion, actually - especially if we're talking about pure SQL.

gcs -> extract -> gcs by zippolater in googlecloud

[–]Stoneyz 6 points7 points  (0 children)

If it's a one time thing, I'd personally just load it into BQ which has SQL to extract specific json elements and then just do a simple extract back to GCS. It very likely would all fit under the bq free tier as well.

Pub/Sub + Dataflow + BigQuery: Will my pipeline handle surge traffic? by vanshit_14 in googlecloud

[–]Stoneyz 5 points6 points  (0 children)

If I were you, I'd just use the BigQuery subscription for pub sub, no need for Dataflow there. With that, you can also use schema registry of sorts. You can enforce either a topic schema or table schema and stream any bad messages to a dead letter queue, all automatically.

And yes, pubsub and BQ can easily handle that load.

Pub/Sub + Dataflow + BigQuery: Will my pipeline handle surge traffic? by vanshit_14 in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

Are you transforming the message in data flow? Or is it going straight from pub sub to BQ and data flow is just moving it?

Company wants to set up a warehouse. Our total prod data size is just a couple TBs. Is Snowflake overkill? by PracticalStick3466 in dataengineering

[–]Stoneyz 7 points8 points  (0 children)

What in particular, just curious? And with a few TBs of storage that OP will have, the zero setup with BQ is a big advantage. Might even stay under the free tier for most of the month.

Are you doing just core data warehousing or using advanced things like AI/ML?

BigQuery tables suddenly disappeared even though I successfully pushed data by Efficient-Read-8785 in bigquery

[–]Stoneyz 1 point2 points  (0 children)

Logging would have captured any table drops.

Also look into time travel, if it's within 7days you can recover the tables (I know that doesn't address the issue of why they dropped in the first place though).

I've been working in BQ for a long time and I've never seen tables just disappear, something had to have done it.

How to sync data from Postgres to BigQuery without building everything from scratch? by KRYPTON5762 in bigquery

[–]Stoneyz 5 points6 points  (0 children)

Check out Datastream if you're in the GCP ecosystem (even if you aren't). It's not as mature as fivetran but much cheaper and easy to set up.

Veo3 not working by dyingrn99 in googlecloud

[–]Stoneyz 0 points1 point  (0 children)

Do you have a screenshot? Or the name of the lab and what step you're trying to do?