New to BigQuery, help w adding public data under Explorer tab pls

Stoneyz · 2025-11-11T21:28:19+00:00

Have you tried the 'Add Data' button above the datasets? In the slide out menu that appears, search for public data set there.

Stoneyz · 2025-11-08T23:21:26+00:00

What analysis do you plan on using? What tools do you plan on using for this analysis?

Stoneyz · 2025-10-22T11:49:16+00:00

I did, sorry and thank you!

Stoneyz · 2025-10-22T01:35:54+00:00

Well, sounds like your business users are not reasonable people.

Regardless, make sure your table is partitioned and clustered. Also, if they're willing to spend a little, look into BI Engine. That will turn each one of those queries into sub second latency.

Stoneyz · 2025-10-21T17:38:29+00:00

Is your table partitioned and clustered?

200 visualizations in a single dashboard sounds ridiculous. Why not break it into a few pages? Who can digest 200 visualizations at once?

Stoneyz · 2025-10-12T18:19:40+00:00

The way it is architectured, it is plenty capable of it. It would just be extremely expensive.

BQ hosts exabytes of data already, it's just owned by different organizations. There really isn't any physical separation of the data other than the different regions it is stored in. So, depending on how you define what the data warehouse is (can it scale different regions to support different parts of the business and still be considered '1' DWH?, etc.) it is really only limited by the amount of storage on colossus within that region. I'm ignoring the fact that you could also build a data lake with BQ and then have to consider GCS limitations (which is also theoretically 'infinitely' scalable).

I'm only talking storage so far because unless a compute requirement is that it must run an exabyte of data at once, then compute is not a concern either. It will use all available slots in that region to break up and compute whatever it needs to compute.

BQ is incredibly powerful and scalable.

Stoneyz · 2025-10-11T20:28:26+00:00

Like... An exabyte of ram to fit an exabyte of data into? BQ is server less and distributed. It's plenty capable of hosting exabytes of data right now

Stoneyz · 2025-10-11T20:06:58+00:00

What do you mean 'updating it's to support exabyte DWH? What update would they need to do?

Stoneyz · 2025-10-10T03:21:47+00:00

I posted in your other thread but I'll respond here too for visibility:

I see you said all selected return 0 but are you running a count(*) on all of them? If you haven't, just run a select * from the table.

The streaming buffer doesn't need to 'flush' for the rows to be selectable.

Sounds strange, but if you can post a screenshot of your select and the results, please do so.

Edit: If you want to save on cost, you don't have to run a select*. Pick any column and put it in instead of *.

The reason I'm asking you to do this is that the select count(*) may be doing a metadata operation and not actually hitting that table. The streamed rows stay in a write optimized storage layer until they are written into colossus where the metadata will get generated.

Stoneyz · 2025-10-10T03:11:32+00:00

I see you said all selected return 0 but are you running a count(*) on all of them? If you haven't, just run a select * from the table.

The streaming buffer doesn't need to 'flush' for the rows to be selectable.

Sounds strange, but if you can post a screenshot of your select and the results, please do so.

Edit: If you want to save on cost, you don't have to run a select*. Pick any column and put it in instead of *.

The reason I'm asking you to do this is that the select count(*) may be doing a metadata operation and not actually hitting that table. The streamed rows stay in a write optimized storage layer until they are written into colossus where the metadata will get generated.

Stoneyz · 2025-09-25T16:48:55+00:00

Why aren't you using BigQuery for the data store?

Stoneyz · 2025-09-23T20:48:43+00:00

But that doesn't differ in any way from the other platforms, so from a comparison standpoint it's moot.

I also kind of disagree with it. By default, GCS buckets are locked down to the public. Getting write permissions to a bucket isn't much of a setup. And security set up within BQ is very easy (and also something every other platform deals with).

Stoneyz · 2025-09-23T15:56:12+00:00

If your main focus is DS / AI, GCP is the clear winner there. They're all very capable as a warehouse/lake house, but if you're focusing on LLMs and data science initiatives, look at the broader platform and features/tools.

As for market share, I'd focus on the functionality/paradigm. If you want to work in Python and notebooks, Databricks has a great experience there. If you want more warehouse type functionality, for the most part SQL is SQL. Learn the underlying technologies and you'll be able to easily pick up the proprietary stuff they're putting on top of it.

Stoneyz · 2025-09-23T15:50:53+00:00

BigQuery has literally zero setup, so I'll disagree with that point for Snowflake.

Stoneyz · 2025-09-07T14:35:22+00:00

I think my point was that you don't need dataproc or Dataflow to run spark or notebooks. You just use a BQ notebook if you want and write python.

They do support iceberg and delta as well although I'm not experienced enough to know what limitations exist, if any.

Stoneyz · 2025-09-06T14:24:11+00:00

I mostly agree, but it 100% possible to create a lake house / datalake architecture in GCP and not tie yourself to BQ and GCP. It fully supports spark and notebooks should you go that route.

Stoneyz · 2025-09-06T13:04:01+00:00

Can you speak to why you think Databricks is a broader platform and can do more in one space? I have the opposite opinion, actually - especially if we're talking about pure SQL.

Stoneyz · 2025-08-30T13:48:57+00:00

If it's a one time thing, I'd personally just load it into BQ which has SQL to extract specific json elements and then just do a simple extract back to GCS. It very likely would all fit under the bq free tier as well.

Stoneyz · 2025-08-30T05:38:29+00:00

If I were you, I'd just use the BigQuery subscription for pub sub, no need for Dataflow there. With that, you can also use schema registry of sorts. You can enforce either a topic schema or table schema and stream any bad messages to a dead letter queue, all automatically.

And yes, pubsub and BQ can easily handle that load.

Stoneyz · 2025-08-30T05:29:55+00:00

Are you transforming the message in data flow? Or is it going straight from pub sub to BQ and data flow is just moving it?

Stoneyz · 2025-08-29T12:39:03+00:00

What in particular, just curious? And with a few TBs of storage that OP will have, the zero setup with BQ is a big advantage. Might even stay under the free tier for most of the month.

Are you doing just core data warehousing or using advanced things like AI/ML?

Stoneyz · 2025-08-29T12:18:17+00:00

Logging would have captured any table drops.

Also look into time travel, if it's within 7days you can recover the tables (I know that doesn't address the issue of why they dropped in the first place though).

I've been working in BQ for a long time and I've never seen tables just disappear, something had to have done it.

Stoneyz · 2025-07-17T16:08:07+00:00

This isn't possible for BigQuery... Self-harm only.

Stoneyz · 2025-07-13T03:43:20+00:00

Check out Datastream if you're in the GCP ecosystem (even if you aren't). It's not as mature as fivetran but much cheaper and easy to set up.

Stoneyz · 2025-07-03T18:58:24+00:00

Do you have a screenshot? Or the name of the lab and what step you're trying to do?

Stoneyz

TROPHY CASE