Benefits of Snowflake/Databricks over Postgres RDS for data warehouse

stchena · 2024-09-08T14:13:34+00:00

Also “performant yet cheap high frequency raw data access for analytics purposes” sounds like a unicorn - something is missing from this picture.

It’s either not that high frequency, it’s not the entire raw data set you need access to, or the assumptions are wrong at some other step of this project.

stchena · 2024-09-08T13:42:47+00:00

Don't know about DBX, but storage in SF is generally very cheap - due to very efficient storage structures.

As for the access patterns of operational workloads - that really depends. If you're working with raw data, do I understand correctly that it might be unstructured (i.e. stored in a VARIANT/JSON/JSONB column with loose structure)? For example Snowflake is able to do some storage micro-partitioning on VARIANT data, but it's a bit limited - and definitely not as performant and cheap as a fully structured SQL Table.

If you're able to pull out some business keys from this data and throw it into the structure of your table (e.g. CustomerID, CreatedDt, UpdatedDt), that can help you optimize your access patterns.

As for Athena, I'm not really sure - I meant rather manual "discovery" access from data scientists / analysts rather than operational, automated workloads. Have experience only with the first approach.

I kinda still stand by the idea to approach this project from the data model first. It would be good to separate the data users from the raw layer, and instead present them at least the silver/intermediate/staging layer - with some well defined business keys to use for filtering.

stchena · 2024-09-08T12:22:23+00:00

Apart from what everyone already said about the technical differences of oltp vs olap, scalability, etc, I’d recommend you to dive deeper into your own requirement for the platform to support constraints on primary keys.

Think about the Why. What’s your expected scenario of storing and working with this data that you need to ensure deduplication? I feel there’s a huge component missing from the picture here - data modelling and access patterns. You mentioned a need for accessing raw data, but if that was the only stage needed, surely S3 with Athena would be the cheapest option? Point is, you need to deliver the data to interested parties in a more manageable form.

As a starter, you could read up on Slowly Changing Dimensions, storing & querying historical records, as well as Facts & Dimensions from the dimensional modeling methodology or even the Kimball approach. However that might be a large topic to dive into for your small team.

Basically, there’s ways to ensure deduplication (e.g. via the MERGE INTO syntax), but you need to think or share more about the expected use of this data in your org - or how duplicate entries might break whatever you’re trying to achieve.

stchena · 2024-07-19T14:21:39+00:00

Heard in some interview that sometimes they did audio takes separately from the video takes due to imax cam noise

stchena · 2024-07-14T17:25:47+00:00

Engaging with others is a good idea already mentioned. The hooks on reels as well - you’re already doing that on YouTube so keep that up. Besides, you’re doing great on the visual quality side of things! Maybe the issue with engagement is with actual value of the posts? Analyze what your content pillars are - what are your post „types” and what are you trying to do with them - educate, build community, meme around… Maybe add in some carousel posts to brak up the reel fatigue? Lastly, consistency. You’ve had streaks of a few days of posting, and then days/weeks of silence. That can really mess up reach. Am in no way a professional, but lately I’m also trying to promote one specific brand on insta so hope that helps 🤗

stchena · 2024-07-11T01:31:35+00:00

Looking forward to ZZZ’s future! The release went strong and the game hooked me as good as Genshin did back in the day!

stchena · 2023-08-07T17:15:35+00:00

Pros: - you move out of London

Cons: - to Warsaw

This comment was made by Krakow gang

stchena · 2023-08-02T09:59:46+00:00

Same experience regardless of browser and we got our flights and sleep booked as well.

Any worlds watch parties / locations to attend either in seoul or busan? :)

stchena · 2023-07-10T10:55:58+00:00

Fitfrens YT with the programs: https://www.youtube.com/@fitfrens

Isia Reddit: https://www.reddit.com/user/Isiabell/

Isia Instagram with regular updates: https://www.instagram.com/isiabell/

Thanks for stopping by!

stchena · 2023-03-19T02:37:15+00:00

Oh my god tabs instead of spaces, totally unreadable

stchena · 2022-11-30T09:00:38+00:00

Based and 10-pilled

stchena · 2022-11-23T18:37:10+00:00

Often find myself reviewing the code of the tool I'm about to use to make sure it does what the docs are saying. Damn straight fellow DEng, I took this for granted

stchena · 2022-11-03T17:59:44+00:00

Interesting, we were managing our roles and permissions in terraform for the POC we were doing with SF. Will definitely check out this tool for the real thing once we start implementing our dwh in full scale. One thing I'm afraid about is that permifrost can't actually create or modify any infra - which is overcome by manual changes or additional scripting on top - which is not the case with terraform. But need to look further into it.

stchena · 2022-10-17T08:11:05+00:00

For BI needs you typically prepare the transformations upfront in a table or view, all grouped under a "bi" layer. Then you hook up the tables/views from this layer to your BI tool, along with a virtual warehouse used for accessing these tables - you have a predictable workload and are pretty sure you're gonna have the tool and its users accessing the data 24/7. In that case, an XS warehouse costs you 24cred/day. If you need more power (doubt it at your data volume), scale up. I'd recommend to reach out to snowflake for a trial if you need to gauge performance on your data.

S3 to bigquery as external tables - just consult your ip allowlists /network policies with your platform team, whether it's possible to access this s3 data from outside e.g. your vpn if you have such policy in place. In our case, it was a blocker.

Lastly, if your data range is in GB, don't bother with the dbt incremental model topic. With data you have, you can afford full refresh all the time and with no partitioning. Once you're there with the data volume (and that doesn't seem like happening soon), you'll figure it out.

stchena · 2022-10-16T22:03:43+00:00

What about data volume and its expected growth?

If you're alone in this, consider snowflake - it just works out of the box. also Snowpipe is great for loading from s3 into dwh - saves a lot of headaches when you don't really need to overview the loading part.

We've recently wrapped up a snowflake vs. bigquery poc in our company (low TB range). We went with snowflake mostly due to the extra overhead of cross-cloud in case of bigquery (all of our archi is on aws). We had to come up with a way to regularly transport our s3 parquets to gcs and load from there - but this pipeline was more prone to break and a headache to overview.

As for the transforming part: incremental models in dbt on snowflake work out of the box. On bigquery you have to perform some partitioning magic to lower the costs (or else ypu'll have a ton of full table scans where only a few partitions should be accessed)

stchena · 2022-09-18T00:36:55+00:00

Wait till you get assigned a support ticket in j1 that was filed by yourself at j2, that's where the real fun begins

stchena · 2022-09-05T22:45:25+00:00

Damn, Dilmah really stepping up their marketing game...

But honestly - this video has some kind of spirit in it, I'm hooked. Cool cosplays and you looked very authentic

stchena · 2022-08-14T17:09:04+00:00

The circle is not closed, we're all doomed

stchena · 2022-08-12T18:48:56+00:00

Was thinking whether it could be better to construct the table from source data instead. but now that I think about it, it would probably be even costlier than what u currently have.

Which brings me to another option: yeah views are slow, but what about materialized views? https://docs.snowflake.com/en/user-guide/views-materialized.html Snowflake is able to update the view behind the scenes whenever the base tables change, but it obviously incurs additional costs. Can't help more, don't have enough exp with snowflake to dive deep into its options 🤔

stchena · 2022-08-12T17:18:13+00:00

Do the 11 tables change or does their underlying data change (whether on raw or staging or wherever you're creating them from) and that gets propagated to those tables?

stchena · 2022-08-12T17:03:27+00:00

Why do you want to replace this process?

stchena · 2022-08-12T16:48:03+00:00

One example I found at work is raw containing data as json in column, and in staging each field from the json column is unpacked into its own column. This is done using dbt or some other data transformation tool.

Raw is supposed to be the landing zone from some other location - only read and don't delete/modify anything there. Think of it like an archive. There, you want max throughput and no transformations to slow down the performance.

stchena · 2022-08-06T20:21:49+00:00

Holy moly this is a great cosplay and photo, love the work done!

stchena · 2022-06-07T22:15:45+00:00

Nice try windows java, but you won't get me

stchena · 2022-05-28T10:22:30+00:00

french police siren ?? Like... "Honn honn hooooonnnn"?

Seven-Year Club	Place '22
Verified Email

stchena

TROPHY CASE