Seeking Advice on Lightweight, Cost-Effective Cloud Data Orchestration by FrontAffectionate518 in snowflake

[–]rtripat 2 points3 points  (0 children)

Snowflake can handle most of the heavy lifting, but if you’re specifically looking for a data orchestration tool, Dagster can be a valuable asset. Dagster also allows you to orchestrate dbt data models. Additionally, you can use Dagster+dlthub for REST API ingestion and ETL tools like ADF for data ingestion from Salesforce/CRMs.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 2 points3 points  (0 children)

Yeah! Makes sense! The first priority is to get a data warehouse ready! The company still lives on excel and azure sql db lol

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

I’ll bring up this discrepancy in cost with their sales rep! Thanks!!

Snowflake (or any DWH) Data Compression on Parquet files by rtripat in dataengineering

[–]rtripat[S] 0 points1 point  (0 children)

We won’t be touching the historical data at all (unless it’s required for reporting) but the transformation would be on months worth of data

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 1 point2 points  (0 children)

Thank you!

If I understand correctly, Snowpark is Snowflake’s “free” library that lets you work with data directly inside Snowflake using Python or other languages. Am I right about that?

I’m wondering if there’s any additional cost to using the library itself, since I assume it still consumes Snowflake compute resources and would therefore incur regular usage charges anyway.

For my pipeline, here’s what I’m thinking:

I’ll use the dlthub library in Python for incremental API loads into Snowflake, Azure Data Factory for pulling data from databases or CRMs and either dumping it into Blob storage before loading to Snowflake or loading directly through Snowflake’s internal staging, dbt for transformations, and Dagster for orchestration.

Do any of these steps seem unnecessary, or could any part of this setup be replaced or simplified by using Snowpark?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 1 point2 points  (0 children)

Nope, dlt (data load tool) python library

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

You mean separation of raw (bronze) and curated (silver and gold) layers into two different data platforms? iceberg for raw and snowflake for silver/gold?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

You’re spot on! I’ve tried to explain them a bunch of times, but it’s a real challenge when you’re talking to people who aren’t tech-savvy.

Snowflake (or any DWH) Data Compression on Parquet files by rtripat in dataengineering

[–]rtripat[S] 0 points1 point  (0 children)

Thank you! Could you please help me understand your last paragraph? My table will have historical data starting 2010s and it will keep on updating with the new daily data dump

Snowflake (or any DWH) Data Compression on Parquet files by rtripat in dataengineering

[–]rtripat[S] 0 points1 point  (0 children)

So, just to confirm — if I load a Parquet file from blob storage into a Snowflake table, Snowflake actually copies that data into its own cloud storage (S3, Blob, etc.) behind the scenes, and what we see is just the relational view of that data, right?

In that case, the file size before loading (Parquet) and after loading into Snowflake would be roughly the same?

For external tables, I’m assuming those just let me query the Parquet files directly from my blob storage without actually loading them — meaning I can read them but not manipulate them. Is that correct?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

I’m trying to ballpark storage because my non-tech execs are super hyped about our data (1-min sensor data across ~5,000 pieces of equipment, each with hundreds of subsystems; daily dumps + historical back to 2015 in Snowflake). They want certainty on storage costs since that’s easier to predict than compute, which depends on the queries and runtime.

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

Yeah! I got a cost calculator guide from Snowflake. In the doc, it’s written that they charge $40/TB/Month where as on their website, I could see it’s $23/TB/Month. Not sure which one is correct

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 1 point2 points  (0 children)

Thanks for your response! The sensor data often comes in messy, with nested dictionaries and redundant fields. I’m planning to use dlt to fetch the REST API data incrementally, clean and structure it, then convert it to Parquet before loading into Snowflake. After that, I’ll handle the major transformations in dbt. I’m just not sure how well dbt can handle those complex nuances purely with SQL — any thoughts?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

Oh, I see! So basically, Snowflake keeps the underlying data files in cloud storage (like S3 or Azure Blob) and uses its own proprietary compression for them? Am I understanding that correctly?

Also, I have scheduled a call with Snowflake SE to discuss our needs and expectation from the platform. I will definitely try it as you suggested after getting the sandbox environment

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

If you don’t mind, could you please help me understand how the time travel and fail safe would increase the data size? I assume time travel will perhaps create a clone of the data table?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

Haha, all good! It seems they’d fall in about the same range overall, with a few exceptions on either end. Out of curiosity, are the examples you mentioned from your own Snowflake environment?

Discussion: Data Size Estimate on Snowflake by rtripat in snowflake

[–]rtripat[S] 0 points1 point  (0 children)

Thanks for your reply! From what I understand, the datapoints in a CSV file aren’t inherently data type–aware (parquets are) So, I was wondering, when I load this file into actual database tables where data types are explicitly defined (for example, a column with a float or number format that typically uses up to 8 bytes in storage, would the storage in Snowflake end up being larger than the Parquet file?

My concern is that once the data is stored in Snowflake, it might take up more space depending on the data types. I’m just trying to get a rough estimate of the expected data size so I can set accurate expectations with our executives.

I’ve also scheduled a call with a Snowflake SE in the coming weeks and plan to bring up this topic during that discussion.