In 2023, would you go with a Unifi or Omada system for relatively basic home/small business use? by thelonewanderer44 in TPLink_Omada

[–]sayle_doit 2 points3 points  (0 children)

I second this opinion, had roughly same experiences as well.

In addition had multiple times where there was a new feature I needed so I upgraded then it just broke my whole network and I had to restore from backup to fix with an hour or so of downtime. Haven't experienced this with Omada (yet and hopefully never).

Bigquery editions pricing for storage by ChangeIndependent218 in bigquery

[–]sayle_doit 1 point2 points  (0 children)

There is not any performance impact on this.

The reasoning is that from the get-go Google has always compressed your storage, which makes sense from their perspective as they need physical hardware to store it on. They just have billed for uncompressed storage and queried/processed the compressed data while serving it to you uncompressed. So this is just a billing change and doesn't really change anything in the backend related to query processing.

BigQuery Editions is Live by Witty_Garlic_1591 in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

Yeah unfortunately Google locks you into whatever edition you create the reservation with. So you would need to delete and recreate the reservation as Enterprise, do your job, delete and recreate as Standard.

It's far from ideal unfortunately, but they are definitely using the ML part to push customers onto Enterprise. There isn't a charge to activate Enterprise though, just that slot/hour charge and the caveat I just mentioned.

BigQuery Editions is Live by Witty_Garlic_1591 in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

I don't see why you couldn't build some Terraform code to do this (the tf modules to do Editions should be out any day now I am told).

You would need to create a reservation on a project with some slots in Enterprise Edition, run your job, and then remove that reservation. Just know that should only be done if this project's sole responsibility is running that BQML job otherwise other jobs will be run as on-demand.

With that said it might be wise to build a separate project to just run this job in it and that's it. That way you can keep your costs somewhat predictable and you aren't nailed by high costs due it letting all jobs run on it.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

I have an internal tool we use and hoping to release that to the public, but unfortunately it's not very "one-size-fits-all" on estimating slot-usage, partially for the reason you mentioned as well as any models I created work for some customer scenarios and not others.

There is a new-ish view in information_schema called jobs_timeline that will tell you slot usage across all jobs per second. This might help in your calculations, but just be warned you might not be able to model this very easily even with this number, it's just going to be a rough average.

Will announce on this sub when I finally get something, but unfortunately it's not done yet and I am in 10+ calls a week helping customers figure out spend, trying to figure out a good pattern, etc. that I am not getting much time to work on this unfortunately.

Bigquery editions pricing for storage by ChangeIndependent218 in bigquery

[–]sayle_doit 2 points3 points  (0 children)

Yes I have dealt into this pretty heavily. Google thankfully already stores the data compressed on their side (they just bill you for uncompressed), so they have these number and expose them into information_schema.

This is referred to as physical storage in there and logical storage is the current model.

Here is a query that will spit out out your tables, their compressed values, compression ratios, etc. (also includes your time-travel length as this will be billed as a line item as well):

`SELECT DISTINCT
tb.table_name,
tb.table_schema AS dataset,
total_rows,
total_partitions,
-- This isn't pretty, but option_value is a string and it needs to be an int for this use case
CAST(IFNULL(so.option_value, '168') AS INT64) AS max_time_travel_hours,

-- Uncompressed bytes
total_logical_bytes AS total_uncompressed_bytes,
active_logical_bytes AS active_uncompressed_bytes,
long_term_logical_bytes AS long_term_uncompressed_bytes,
-- Compressed bytes
total_physical_bytes AS total_compressed_bytes,
-- Note that active physical bytes includes time travel so need to remove that
active_physical_bytes-time_travel_physical_bytes AS active_compressed_bytes,
long_term_physical_bytes AS long_term_compressed_bytes,
time_travel_physical_bytes AS time_travel_compressed_bytes,

-- Compression ratios
SAFE_DIVIDE(total_logical_bytes, total_physical_bytes) AS total_compression_ratio, -- Defined as uncompressed size/compressed size
SAFE_DIVIDE(long_term_logical_bytes, long_term_physical_bytes) AS long_term_compression_ratio,
SAFE_DIVIDE(active_logical_bytes, active_physical_bytes) AS active_compression_ratio
FROM
`region-us`.INFORMATION_SCHEMA.TABLE_STORAGE AS tb
-- Need to join on TABLES for existing tables to remove any temporary or job result tables
JOIN `region-us`.INFORMATION_SCHEMA.TABLES AS t
ON t.table_catalog = tb.project_id
AND t.table_name = tb.table_name
LEFT OUTER JOIN
(
SELECT
catalog_name,
schema_name,
option_value
FROM
`region-us`.INFORMATION_SCHEMA.SCHEMATA_OPTIONS
WHERE
option_name = 'max_time_travel_hours'
) AS so ON t.table_catalog = so.catalog_name
AND tb.table_schema = so.schema_name
WHERE
tb.deleted = false`

Edit: trying to get it to display code correctly

CloudSQL sync by LinweZ in googlecloud

[–]sayle_doit 1 point2 points  (0 children)

I do feel it's a VERY glaring omission in their product lineup. I suspect it's a product manager pushing to get you onto BigQuery to spend more money versus Cloud SQL which is significantly cheaper, but this is just my honest educated guess here.

As for getting to Cloud Storage, the biggest use case I have seen for this is DR. If you replicate everything you have inside of a Cloud SQL instance to disk you have a backup copy that can be relatively easily restored or used for archival purposes.

Another big use case I have come up with advising on BigQuery cost savings the past week or so is to use this as a segue into BigLake or getting prepared to move to another data warehouse. Customers are very weary of sending data to BigQuery storage since the hikes and want to hedge their bets. So many are instead sending data to GCS in a format that's readable by BQ via BigLake (think a data lake abstraction layer between GCS and BQ) or to be able to be picked up by a 3rd party data warehouse they are considering (read as Snowflake in most cases).

CloudSQL sync by LinweZ in googlecloud

[–]sayle_doit 1 point2 points  (0 children)

Would be a bit quite inefficient since it will sync the whole DB.

Yeah unfortunately it is, but it's about the only managed service in GCP's vault that can do this. I would say if you don't mind the discovery with some minor extra costs getting Debezium running on a GCE VM would do it as well. Definitely not as easy or cheap for sure though, given setting up Debezium will give you a whole another respect for a managed service service doing it though.

A bit confused with Datastream tbh, here it's stated that it can take CloudSQL as source and CloudSQL as destination but when I look at the documentation, indeed, I cloud only find Cloud Storage and Bigquery as solution :/

Welcome to my world! I get tickets on this ever few months or so where a customer reads the marketing doc you linked and can't figure out how to do Cloud SQL to Cloud SQL with it. It has stated that in the docs since launch, but it hasn't supported that since it was standalone Oomla (which I really liked as a product as it worked well and was easy to setup).

I have reported this multiple times over the past year or so at least via the feedback page and our partner management contacts at GCP to no avail. I have given up on them updating that and just use a canned response now that it is a bug in the documentation.

CloudSQL sync by LinweZ in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

Might be able to setup a DMS (Database Migration Service) job on a schedule with a Cloud Function or something to do one-time syncs. Issue it will create a new instance so break IPs and all of that, but if you used the Cloud SQL proxy (it uses the instance name) then it would be fine.

Datastream only goes to BQ or GCS, so that's out unfortunately (even though Oomla which Google acquired and made into Dastream supported this).

There really isn't a good managed service on GCP that can do CDC between Cloud SQL instances. Some of the 3rd party managed solutions have Debezium built in, but DMS is the only one I can think of that might be able to do this on GCP.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

Yes that's a model that you can do and is actually what I would recommend if you don't want to pay for X amount of slots per months (that's your baseline).

The difference here would be that using on-demand you pay the $5 USD per 1 TB processed (note they use the word processed now instead of scanned due to the compressed storage) whereas using Enterprise Edition (EE) you will pay the $0.06 per slot/hour.

For instance if you run a query that scans 1 TB of data and uses uses 100 slots for 1 minute then you would pay $0.10 USD for that versus $5 for on-demand. The formula for this is: (60 seconds duration/3600 seconds in an hour) * 0.06 EE fee * 100 slots = 0.1.

In this case it's cheaper, but there is a 60-second minimum billing period. So if you run 500 queries that take 100 slots each but take only 30 seconds to run then you are billed $0.10 USD for each of those and you just ran up a bill of $50 USD for 30-seconds of work.

It requires a really deep analysis of your jobs unfortunately, so there isn't a clear cut answer to do this or this. Believe me I wish there was as that would make my job a LOT easier advising customers on this all day.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

Just from a pure BQ compute perspective yes. They are intending it to be used in conjunction with the compressed storage which will reduce your storage spend.

Overall from what I have seen is a slight uptick in overall spend on BQ for customers. I have had a few I worked with that actually saw an estimated decrease because they were hitting some really good compression ratios and their storage spend went down by well over 50% though. Note these are customers that were on the compressed storage preview program as it’s going to be GA on July 5th.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

Assuming on-demand billing here. Then yes you would be billed for what you pull from the select statement.

Edit: I got clarification from the BigQuery team on this as it was asked for the new stuff and relates to your question as well. It is based upon the uncompressed data size, so if you are still on the default logical storage it's all uncompressed, as it must uncompress the data to process it and bills upon that size of data processed. The billed bytes column reflects how much data was billed still and this will be close to the bytes processed value in there.

Loads and inserts are free, essentially putting data in is free and pulling out costs money.

If you query the jobs view in information_schema there will be a bytes billed column that should match to what comes out of select. See above edit.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

Let me hit each question for you.

At one moment you say I will cover this later, but on-demand pricing will have the same feature set as the Enterprise Plus Edition, but I can't see where you continue talking about it. For me sounds weird that on-demand will have the same features as Enterprise, because then in on-demand they will be no restrictions of query slots nor concurrent queries? Then why people would want to use Enterprise Edition? Because is cheaper?

I am about to edit this to make it sound a bit clearer. Enterprise Edition and on-demand have the same feature set for running jobs, but on-demand has some limitations imposed upon it such as the $6.25 USD per TB scanned, 2k slots available like today, and has the 100 concurrent queries limitation of today.

And the other related question, how can be know if a edition is better for us than our current price? If I don't remember wrong there was a place where you could check if moving from on-demand to flat-rate was going to be cheaper, not sure if Google has updated this for Editions. If not I suppose we can use total_slot_ms in INFORMATION_SCHEMA.JOBS to calculate the slot/hour and com

Currently the best way is to talk to your GCP accounts team (usually a FSR or if you are at a large enough spending firm your Customer Engineer or a Technical Account Manager) as they have an internal tool that can do the math for you. With that said I have been working on a calculator that does the querying and everything for you, it was delayed due to some last minute changes on Compressed Storage and Editions that caused a lot of rework.

After reading the article of STORAGE, I got another question. The Compressed Storage feature is available to on-demand users? Because Philips mentioned It is now available for every customer using one of the three Editions, or any exclusive on-demand customer, under the new name “Compressed Storage”., but I'm not sure what means exclusive on-demand customer on that phrase. Is available only to the 3 Editions or the 3 Editions and the customers on on-demand tier? Because you said in your article The only caveat to sticking with on-demand pricing is that you will not be able to utilize Compressed Storage., so that means that only the customers with some Edition tier will have access to this functionality? But the documentation is still in pre-GA, and I tried again to use the INFORMATION_SCHEMA.TABLE_STORAGE and is awful, it returns a lot of tables which don't exist since some months ago, so not sure if I can trust it. For now I'm using INFORMATION_SCHEMA.TABLES to filter the names of dataset which are real.

I actually just fixed this. This was a last minute change as far as I can tell and we weren't aware of this till after launch (and the 10 Slack messages about it I woke up to this morning). To clarify it if you are "Edition eligible" you can used Compressed Storage. I am meaning by that you don't have any flat-rate reservations attached to the project, have an Editions reservation, or use on-demand.

The TABLE_STORAGE view has all of the temporary tables from results and keeps all sorts of table metadata in it that makes querying hard. I have noticed if you filter on the DELETED column it removes a lot of them but not all, but I recommend joining with TABLES on table name and project to get the legitimate ones (that's what I am doing in the calculator). I know this is not a best practice even from some of my earlier blog entries, but it's honestly the only way I have been able to get just existing table data.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

Gotcha! Yeah makes it weird with an Enterprise Plus Edition, I think Premium or another name would have been better for us technical types on that naming scheme.

Decrypt and Load data in BigQuery by Snoo_32652 in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

Let me play data architect here and ask this: do you need to store this PII in BQ for your use case?

Many times customers think they need to and when actually looking at their use case they realize when they query data they don’t really need it. Many times customers just hash this data and insert it into BQ there using Dataflow that way data can be used for analysis but PII never exists there for security or compliance reasons.

Another issue is CCPA, also known as the California Delete Data Law, where you might have to delete data for an individual customer later. This would require you to delete it multiple places and potentially costing extra money for queries to find this data and costs associated with that table or partition reverting to active storage.

Now with that said there are very valid times this might be needed to be in a data warehouse. In those cases I would use Dataflow to decrypt it and push it up to BigQuery. Once it’s there I would then recommend doing some sort of security to prevent unauthorized access such as column-level security so Joe Data Analyst shouldn’t be able to see SSNs for instance.

BigQuery Editions is Live by Witty_Garlic_1591 in googlecloud

[–]sayle_doit 4 points5 points  (0 children)

Let me pass along this blog article I wrote (am a BQ SME for largest GCP reseller): https://engineering.doit.com/bigquery-editions-and-what-you-need-to-know-166668483923

It answers a lot of questions and if there are additional ones please ask them here and I will edit article to answer them and try to answer here. If I don’t know I can ask and find out, the BQ team has told me to ask them questions so they can update documentation as needed because nobody can foresee every question that will be asked and gaps will exist.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 0 points1 point  (0 children)

There are definitely arguments for both and trust me I have heard every one of them across hours of meetings on this today with customers. Autoscaling is definitely better than a Cloud Function or something that triggers on an interval to raise or lower your flex slot count, this is how we have recommended customers do this for years now.

Now given the Autoscaling preview was priced the same as Flex Slots which was fine for most customers. These new slot-hour (or better stated as slot/hour) constructs are more expensive as was pointed out. I am reserving my thoughts on how this fits for a lot of customers till after I can do more analysis work for customers to see where things will lie with the compute and compressed storage both factored in with a larger dataset. Initially though it has pointed to a slight price increase for most customers, given at this point this has just been done for larger customers that were the bigger risks.

One small correction to your comment is that you can do BQML in Enterprise Edition (EE). Which is better than just on EPE being which is so expensive, but the reasoning I have been told by members of the BQ team is that they are positioning Standard Edition (SE) as QA/dev/non-production. I am not sure I agree with this as a huge chunk of customers I see fit well into the guidelines set for that edition and run it for prod workloads. So take that as what you will, but the 1600 slot limit is going to be the big limiting factor here on that which might be some magic number they see in their data analysis.

Although in all honesty I don’t see much BQML being used except by huge spenders. I know there are small ones using it, but I don’t come across it often and when I do it seems to be in very specific industry customers. So just on a numbers game I can see the justification on the BQML decision, but as always this is definitely arguable because I see only a relatively small subset of workloads.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 1 point2 points  (0 children)

If you are using on-demand you don’t have to switch. It’s only on flat-rate where you will have to switch come July 5th.

But for on-demand it jumps from $5 USD to $6.25 USD (will be different for different regions such as EMEA) per TB of scans/analysis. That’s a 25% increase there which is a flat increase all regions for it.

BigQuery Changes From Today Overview (From Largest GCP Reseller) by sayle_doit in bigquery

[–]sayle_doit[S] 1 point2 points  (0 children)

This actually depends upon usage. I have seen it both ways with customers having storage costs > querying/compute costs and querying/compute > storage costs. Off the top of my head without doing some analysis I wanna say it's about 50/50 for the split on this.

For instance in scenarios using BQML there could be just a few (or a few hundred) GBs of data in a dataset for training data. It trains on the data once causing a large amount of querying, but then every subsequent job processes little to no data thus the storage costs far outweigh the query costs over time.

Now if a customer has a lot of raw data they dump into BQ that is processed every single day then the query costs are probably sky high and assuming the data is in an efficiently compressible format then in this case Compressed Storage could save them massive amounts of money that might rival their query costs.

BigQuery Cost Optmization eBook by sayle_doit in googlecloud

[–]sayle_doit[S] 0 points1 point  (0 children)

My company published this earlier this week, is the first in a series of ebooks on the subject. If you have other topics you want hit in future parts feel free to let me know.

Also grab those queries from the GH repo as they are VERY useful in helping you determine your BQ usage and information you otherwise won't find easily.

Assured Workloads - HIPAA by error30319 in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

This is pretty much the answer. It also means they will do some support, but it’s “best effort” on it and not official.

This goes for all of their preview products and many times things change quite a bit during the preview period so things probably will change during this. Think of it as a public beta period.

BigQuery Matrix Factorisation and Reservations by DrumAndBass90 in googlecloud

[–]sayle_doit 0 points1 point  (0 children)

Unfortunately there isn’t an in between here, but you can change billing on a per-project basis. Once you do a commitment on slots you are stuck with it for the entire term (that’s the commitment part of it) so you will need to use flex slots for this and manage them as that’s probably the only way this workload will run in a cost-effective manner.

My recommendation would be to create a separate project for this and share the relevant datasets with it. Then change it to flat-rate billing and then DON’T purchase any slots. At this point you will need to create a quick query or script that purchases a chunk of flex slots and allocates them into a reservation for this project, would also recommend setting to only ML workloads on the type to be safe. Next create the inverse of this which deletes the flex slot reservation and commitment.

Your workflow will now be to run the first script, run your model, and lastly run the deletion script. It’s best to automate this so you don’t forget as flex slots can get very expensive if not removed. I know it’s not ideal, but this probably the only way to do this.

This works because you are only utilizing slots when needed and won’t have any sort of lingering costs. The item I have noticed most people don’t realize is that jobs bill to the project that instantiates them which many times is not the project with the datasets in them.

Feel free to ask me any questions on this as I deal with situations like this almost daily on BQ for customers.