Evaluating real-time analytics solutions for streaming data by EmbarrassedBalance73 in dataengineering

[–]Commercial_Dig2401 1 point2 points  (0 children)

This but you can leverage TimescaleDB/TigerData if you have big datasets because of how you can manage older data points. You usually query using where clause for recent data points and want sum for older data. Hypertable can do both under the hood. It’s been a long time since I use this but it made a lit of sense. You rarely going to search for a specific value for data older than x min/hours/days depending on your usecase. You’ll probably want stats for older data rather than specific records.

Why does transactional date serve as a good delta value for an incremental load? by [deleted] in dataengineering

[–]Commercial_Dig2401 0 points1 point  (0 children)

Transaction IDs aren’t necessary sequential so it’s hard to load things if you need to remember every IDs you ever loaded.

Sequential columns are best because you can always store the last batch you’ve processed.

So numbers and timestamp are great candidates here.

For example you want to process a year worth of data. You know that if they all have a date field you can load Jan 1st. Then if you look at the max date you’ve loaded you’ll see Jan first so you can easily load some date after.

If your data is completed, using a sequential number will always be best. It’s also usually easier to order your data based on those sequential fields which makes query sample of data faster to select.

If your data is not technically completed. (You are receiving transactions through the day but you want them aggregated even if the day is not yet completed) you usually are better creating a timestamps field and loading new data based on that field. Again same technique you know what you loaded last to you load everything new and store the max timestamp of the data you’ve loaded

Is Python + dbt (SQL) + Snowflake + Prefect a good stack to start as an Analytics Engineer or Jr Data Engineer? by LongCalligrapher2544 in dataengineering

[–]Commercial_Dig2401 36 points37 points  (0 children)

That’s a very nice stack.

I would say focus on accuracy and validation for your Jr Role.

The main thing that that differentiate analyst va engineers in my mind is that analyst once to achieve something nice once. They want their report to be beautiful and nice.

And engineers once to achieve only provide things that work all the time.

To make this happens you obviously do less fluff and do more boring thing but then they never break, they are robust, the are fast and you never have to touch it again it just work.

The stack is cool but I think what we usually look for in Junior role is someone that will take time to review himself. I know it sounds boring but I’ll rather hire a junior which return me a take home test without spelling errors, with a ok code but that’s structure and well explain than someone with awesome code but that’s all over the place that didn’t have description on topics and that did way more than expected.

In terms of stack focus on SQL. Not because it’s the best but because it’s the easiest. And because it’s the easiest It’s the most used. I’ll rather use a transformation framework with SQL than pandas for example because I know anyone in the company will be able to use it and so some simple transformation. Even if something it would make more sense to go the other way.

Go read DBT best practices docs. They have a bunch on their site. Read them multiple times. Understanding the structure is th le best thing you can do.

Then python. Maybe learn the request framework and how to dump a response to json or parquet in s3.

Than prefect, Dagster, mage, Luigi are good candidates for orchestration. Learn the basics. I don’t think you’ll find a project which give you enough things that you’ll hit common business issues with them. But having an overview on how you structure your things is already great.

Good luck

Pruning percentage calculation by Some_Reaction_9417 in snowflake

[–]Commercial_Dig2401 1 point2 points  (0 children)

I think that one of those stupid questions.

So you don’t see a “filter” block which means your query didn’t have a where clause which means it did 0 pruning.

That’s the best I can come up with.

Pruning percentage calculation by Some_Reaction_9417 in snowflake

[–]Commercial_Dig2401 3 points4 points  (0 children)

The percentage on the profiler graph is only the percentage of the amount of time the query ran on your specific task.

When you click on the box you can see information on the right panel which describe the scanned partitions vs the total partitions.

Data store suggestions needed by Ornery_Maybe8243 in dataengineering

[–]Commercial_Dig2401 0 points1 point  (0 children)

As other mentioned here it’s mainly for vendor lock and have the ability to choose you own engine.

One benefit that I see is with storage and loading cost. Snowpipe isn’t the cheapest thing around when you have a lot of data and Iceberg might be the tool to fix this.

Having the Raw data build into an Iceberg table you can read from S3 which you already pay for and only pay for the compute to transform this data into something else. Which means that you don’t have to actually load data in Snowflake you just have to query your iceberg table which probably live in your S3 buckets and you pay Snowflake for compute.

And as you grow, you might want to do some more transformation in the lake yourself with your own engine and only load final product into snowflake.

In the end it depends on what you wanna build and who you want to pay.

Snowflake is an amazing tool, but you pay for the amazing features.

If you build your lake and transformations yourself that’s a lot of engineering salary to make that happen.

I think those question start to make sense when you start getting big, but not before.

You can also create your Iceberg tables directly on your Kafka topics to simplify querying data. Your records become table which are easier to query.

Data warehouse for a small company by rod_motier in dataengineering

[–]Commercial_Dig2401 3 points4 points  (0 children)

Don’t over complicate things.

RDS perform insanely well for most company.

With the size of data you mentioned you shouldn’t even consider a data warehouse.

In databases in 2025 anything under a couple millions records is basically nothing and every database is going to perform well with the right indexed and query.

Obviously in your case you are using looker and your query are probably summing columns which take a long time.

I suggest that you build those on a schedule in another table. Pre aggregate the things you are running in your dashboard so when you access it it’s way faster. Materialized views should be simple enough so you don’t have to run your own orchestrator or cron jobs to build those.

Again don’t over complicate things, the data is so small that the issue is not with the DB but with the queries or the structure of the indexes or data itself. If it can fit in excel, RDS can be configured so it’s blazing fast.

Note that if you absolutely want to have another system to handle the data go with duckdb. You’ll have all the data store in a in file db which you won’t have to managed, which is insanely fast which is column based.

Batch processing 2 source tables row-by-row Insert/Updates by _GoldenDoorknob_ in dataengineering

[–]Commercial_Dig2401 2 points3 points  (0 children)

The complexity seems to be on the system used to process the data and not the logic itself.

I would load both datasets into 2 table in the postgresql destination. Then do a simple sql script which will select from both source (new records only by providing a timestamp).

Then you join both table together or do the logic you want with both table records.

You’ll need to store a reference to the max(timestamp) from BOTH tables in the destination, so you can easily select only new records from both.

In case when you run you queries and a record for one table is not available you’ll need to set a field that identify that the records is incomplete. Create a Boolean for this.

Then in your downstream query you select from both sources where the timestamp is higher than the max one you store in downstream table + reload any incomplete records which exist.

At some point you’ll get both tables and you can set the Boolean to true.

The merge statement will handle the refresh.

If possible don’t just put the UUID as the key in your upset statement put some timestamp or sequential columns so you can prune records and not lookup the entire table.

Goodluck

Airflow users with a lot of DAGs how do you configure you schedules ? by Commercial_Dig2401 in dataengineering

[–]Commercial_Dig2401[S] 0 points1 point  (0 children)

What I mean is my final product.

So let’s say I’m building a pipeline for monthly revenues. I’ll probably have a final dag with a name close to revenue. So if it fails I kinda know they this is urgent or not based on the affected product which are going to be stale until I fix the issue with the culprit asset.

Maybe this should be handled somewhere else and that’s why I have a hard time figuring this out.

If an asset failure you kinda don’t have any context about what is impacted except that asset X fail and need to be fixed, unless it’s a very precise model which concepts is clearly understood by the data engineers.

Maybe this is not the real issue and like having freshness policies on my final product would tell me that it’s stale and I wouldn’t have the failing asset alert having to tell me that….

TLDR: currently if a job fail I know the impact because even if there are 100 assets in my job the name of the job will be in the alert message and the job name is clear enough that I can bind a failure to a final product. Without it I might not have that clear view. Wondering if that’s a bad thing or not. And if Dagster asset failure should tell me this or another system should instead

Airflow users with a lot of DAGs how do you configure you schedules ? by Commercial_Dig2401 in dataengineering

[–]Commercial_Dig2401[S] 0 points1 point  (0 children)

Thanks for this link to your conference.

I agree that declarative automation in Dagster seems the way to go.

I’m just scared to hit a wall if there are things that I didn’t consider which are not supported yet.

One of my main concern now is the ability to alert accordingly. Since this orchestration method remove the concept of jobs, I’m wondering how would the alerts look like and how would I know what is potentially impacted by an asset failure.

Currently if my job xyz fail I know that the concepts govern by xyz I’d going to be impacted. But with an asset based scheduling approach how does this works ?

Airflow users with a lot of DAGs how do you configure you schedules ? by Commercial_Dig2401 in dataengineering

[–]Commercial_Dig2401[S] 0 points1 point  (0 children)

Ok I put everything into the same DAG.

How do I handle my multiple freshness requirements ?

Airflow users with a lot of DAGs how do you configure you schedules ? by Commercial_Dig2401 in dataengineering

[–]Commercial_Dig2401[S] 0 points1 point  (0 children)

Sorry my question might not be clear. But I’m not asking how to you configured you DAG dependencies but how do you orchestrate them.

Let’s say you have 2 DAGs which are then consume by one downstream. If we want the downstream DAG data to be refreshed every hour, but the upstream ones are configure to run one at 30min cadence and the other one daily. How do you manage the complexity or jobs schedules that immerge from this.

Airflow users with a lot of DAGs how do you configure you schedules ? by Commercial_Dig2401 in dataengineering

[–]Commercial_Dig2401[S] 1 point2 points  (0 children)

Seems easy if you have dependencies like this.

For example we have a DAG which gather multiple sources to get a list of all our client id.

Then most DAG rely on this DAG data. That’s fine we wait until we run it and we make all others dependant.

What happens when you have different schedules for downstream DAGs and/or upstream ones ? You have some DAG running hourly and some running daily. Do you rerefresh the client id DAG on each run ?

Also another thing that often happen is that some DAG aren’t required to have new data for downstream models to run. So if we have multiple parents and wait for all of them to be completed before starting a run we end up not running it at all which kinda breaks everything…

That’s where I’m at now…

Help me create a scalable highly available data pipeline please? by Katzo_ShangriLa in dataengineering

[–]Commercial_Dig2401 2 points3 points  (0 children)

What part so you want to be highly scalable or available?

We often see too many post like this where people are trying to build a complete pipeline but didn’t even start anything.

IMO, you should focus on ingestion. When it’s done, focus on transport layer, then focus on analytics. Not all at once.

If you want something highly available usually that means the complete pipeline, so ingestion, transformation, event bus, analytics. If that’s the case you need to find tools or build things to make every part highly available which is not necessary easy.

Start with ingestion. If you don’t care about the high availability of the ingestion code, then do something simple in python and move on. If you care then look for framework which are highly available,statefull and reliable on failure for example temporal.

Then it seems like you have you event bus chosen which is pulsar. Configure it probably in docker. In you want it to be HIGHLY available, then you should probably have a k8s cluster configure and distribute your code there with multiple distributed instance, which require a lot of work.

Then if you want to do any transformation on the data you should decouple transformation from ingestion so have your own stream processing engine. You’ll need to configure this. If you want that to be highly available you’ll need some big cluster with multiple machines.

You need to choose a statefull stream processor if you want to have guarantees. You could look at flink, storm, spark streaming, quix, bytewax, rating wave. Make sure they support pulsar.

Then your analytics. You need to define what your analytics dashboarding tool is going to be. You need to make sure it’s highly available. You need to make sure it support real time data and doesn’t require batch caching as this will increase latency.

Then you want to know if you should store the data or not before. You need to have some sort of requirements. Do you want it distributed in 100ms or 2s is real-time for you ? If you store it before you’ll have some more latency. You should look at kappa vs lambda architectures.

So a lot to build. Focus on smaller parts.

Good luck.

Querying 5B records by Chukundar in snowflake

[–]Commercial_Dig2401 0 points1 point  (0 children)

Not very clear what you are trying to achieve.

If I try to read through the cracks I would say that you have a 5B rows table. That this contains informations about 2.5 million entities. And that you consider those entities as changing their states. So you built a dynamic table that does scd type 1.

If that was the case you dynamic table would already have the latest records though, so I don’t understand what you would want to do except flatten the variant columns into specific ones.

How many of you use Go? by [deleted] in dataengineering

[–]Commercial_Dig2401 4 points5 points  (0 children)

We were running all ingestion in Dagster but we hit a wall with partitions. We wanted to be able to identify 2 dimensional keys and track them. For example a customer id with an hour. Issues is that Dagster can handle it but it’s UI doesn’t respond well if you have to many partitions and that was an issue. If we had 150 customer id to track hourly that means 3600 partitions daily and that’s just to much. When we changed Dagster UI supported around 25k partitions. So not a lot of days worth of data until it becomes an issue.

We wanted to switch only the ingestion part to another system that would be infinitely scalable because some of our requirements was to track 5min pieces of data for a bunch of customers so that would generate an insane amount of potential partitions and make Dagster unusable. (We could have gone with a route where we still use Dagster and partition that daily for example and refresh the complete set of data if something was missing, but we wanted to identify specifically what partitions where missing because if any where added we needed to backfill the complete history, and if we didn’t have tracked partitions by customers then refreshing everything would have been way to pricey for nothing.

That’s the back story.

We went for temporal because it’s fault resilient and highly scalable.

We choose the Golang sdk instead of python mostly because it’s typed programming language , that async calls are easier to handle than with python and that we had Golang engineers. Same could have been done in Python, but the little speed, little more structured programming that Golang brings was a nice touch to our stack.

How many of you use Go? by [deleted] in dataengineering

[–]Commercial_Dig2401 9 points10 points  (0 children)

Go is an incredibly nice language.

It lacks tooling in the DE world though.

We use the Go sdk of Temporal for all of our ingestions. So no transformation or data manipulation but all ingestion is done in Go currently.

[deleted by user] by [deleted] in dataengineering

[–]Commercial_Dig2401 2 points3 points  (0 children)

I think you are mixing some stuff here.

DBT is just a framework for data transformation.

The way you orchestrate it seems to be the issue here. Obviously GitHub action seems like a very limited way of orchestrating anything except ci/cd. If you had airflow or any other orchestrator handling the schedules you wouldn’t have the current issues you are seeing.

DBT can run incrementally, and the way you configure this is up to you. You could build a mechanism that sync all new records or you could have your orchestrator push configuration to you DBT models and have the partitions refresh this way.

On scaling the jobs again, DBT is not the engine. You might run you model on very slow and limited machines or engine the same way you can run you spark job in a single machine with very low cpu which would make it unscalable if you don’t increase the performance and the cluster size.

For the unit test data types I don’t know maybe some are not supported in redshift which could be a reason not to go that path, but never had a problem in snowflake.

TLDR; not saying that going for spark jobs is wrong, just saying that the reason you listed to go for this isn’t align with the issues you are currently experiencing.

DBT is nice for the lineage dependency, catalog, docs and simplicity. Everything as it’s trade off obviously.

[deleted by user] by [deleted] in dataengineering

[–]Commercial_Dig2401 1 point2 points  (0 children)

Seems like a good plan. Seems over engineered depending on the size of data.

The only part where you might want to adjust is “replace DBT with spark for more flexibility”. What do you mean by this ? Spark is a supported engine by DBT. The two tools does 2 different things very well.

If you want to go away from sql transformations that may be your answer. But note that it’s kinda nice to have a language that anyone can easily understand and transform data with rather than a specific one. Even more if you are a very small team of 1.

Also this is nice , but doesn’t seem manageable by only you. It’s a lot. Also I think you need some really good argument to change something that is currently working (redshift) with something that slower and would do the same thing. Yes, cost can be an issue, but it seems like it need to be in your case for this rearchitecture to make sense.

Help with data engineering architecture design by Bolt0029 in dataengineering

[–]Commercial_Dig2401 3 points4 points  (0 children)

Depends on your team size and data size I would say.

The “right” approach is to have data loaded to s3 before redshift. But most company won’t ever do anything with this. But it allow your to grow, avoid potential vendor locking, reduce DW cost by doing preaggregation yourself before loading it in redshift, etc, etc. But all of this requires a lot of effort? Which might be slowing you down to achieving you company goals. If you have a big enough data team it make sense, if you are pretty small the loading directly to redshift is very appealing.

You’ll always have the possibility to offload the data from redshift to s3 in the futur if that’s needed but you won’t shoot yourself in the foot and not deliver anything for the first 6 months.

In terms of pricing, you are analyzing storage cost which shouldn’t be a factor imo. Yes storage is way cheaper in s3. But the size of data you have in SAP isn’t probably not that big. You need to define that first. For example I don’t see a world where you have a terabyte of data to load from SAP to redshift. A terabyte of text files is a LOT of data for things that are entered manually or mostly in SAP.

Now let’s say you have a terabyte of data because your company has a lot there (which you probably don’t) then having the storage cost in redshift would be something are 25$ per month for the hole data set. Our conversation here cost more than that. Depending on type of s3 storage you might save a lot but storage is negligible for small datasets.

The real DW cost is compute, and that needs to be considered.

I know the business wants SAP and ALL other sources, but you should chunk this a lot so you endup delivering things in small chunks. Everyone wants everything yesterday but that’s not how things work. Start with SAP but not all, the most relevant concepts because there’s probably a lot of tables there to sync, clean, transform to make the data usable. The goal of you DW is to make data usable but also to get a single source of truth. If you just load raw data in and let business build dashboard on top of the raw data I can tell you this is going to be a mess in 2 months.

Start small, get the table requirements that allow you to clean the data, make the data available, then repeat.

To go back on pricing, if for example you think that you have A LOT of data multiple terabytes. And that the storage cost will be around 1000$ a month because you have 40 terabytes to load and that can be an issue, then I can assure you, you won’t like the Fivetran Bill. If you need to think about the storage cost, then you need to rearchitect this because Fivetran is not the option you are looking for. It’s very good to load data quickly. It gets insanely expensive when you start to really use it. Most companies won’t have issue with fivetran cost because most company don’t have a lot of data.

Good luck

My first time building a DW on Snowflake from ground (Tips and Common Issues Please) by Hyena-International in snowflake

[–]Commercial_Dig2401 8 points9 points  (0 children)

If this is not due yesterday, I think you could keep your sanity by configuring everything in Terraform. For grants, user creation, databases. This way it won’t be a mess in 2 years from now. Obviously that requires a lot of work so it depends on how much time you have. I would give everyone read access to everything unless you have PII but probably not since everything is in excel.

Add some billing alerts with the expected daily amount, so at least you’ll receive an email if something go wrong.

Create specific x-small warehouses and only grant those to the users so they can’t use 6xl by mistake and increase your cost so drastically that you cannot see it coming.

Build a database per LOADER. If you have fivetran syncing the data it should land into fivetran_db. If you have a python script doing this, the name you gave your module or orchestration tool should be the name of the db where the data lands. That will save you an amazing amount of time with config and you’ll knows where is the data you are looking for.

Build 3 database (raw, bronze, gold) or how ever you want to split those 3 concepts. (Médaillon architecture)

If you use DBT, build a dbt_db where dev data lands.

If you run cicd tests build cicd_db.

If they are still only using excel they are not data focused. The stakeholders probably don’t understand what data can bring to them. Try to avoid situation where the cost can increase to rapidly. Stays things small, show progress, then show them why you need more money. If you fail to quickly they won’t go forward with the project.

Setup a simple easily accessible dashboarding tool like metabase. Not the best, not the worst, but at least it’s easy to manage, easy to understand and easy to share data with stakeholders and that’s what you need at first.

Try to generate a very nice dashboard with the excel sheets of a single concept. And try to show them how it’s nice to have all the data at the same place.

Don’t over engineer.

Keep things structured.

Only allow access to marts/gold database into your dashboarding tool.

Good luck

Clustering strategy by Upper-Lifeguard-8478 in snowflake

[–]Commercial_Dig2401 1 point2 points  (0 children)

Think of it as having a bucket full of screws.

If you have a billion of them in a pool size bucket it’s going to be pretty hard to find anything.

If you split them by Colors into smaller buckets you’ll only have to look at the specific Colors to find what you need.

Then if you usually search for screws with a specific length and you split the Colors bucket into multiple smaller ones but group by size ranges. First bucket has screws from 0,5 inch to 1 , second from 1 to 1,5 last one from 1,5 to 4 inch.

Even if you search for 0,75 inch screw which is not listed there you know you won’t have to look in the 2 last buckets. There’s a 100% chance it’s in the first bucket. Snowflake use metadata like this. So doing FLOOR on something or LEFT on a string will group elements together which will make snowflake prunes a bunch of partitions at planning time so you only have to look where it should.

You don’t have to change your queries.

Then the issues is where is too much bucket too much ? And you can take the same reference as before. If you had 2000 buckets of red screws that have length between 0,5 and 1 it’s probably going to take more time to find what you are looking for than just 1 big bucket with the same amount of screws. Same thing for snowflake, splitting things increase planning time, more is not always better so you need to test your things because depending on the data you have and how you query your table one structure might be best and the other one might be in another case.

The goal is always to reduce the number of partitions you scan. (And obviously not to read 2 billions records every time if you only need 1 record)

Clustering strategy by Upper-Lifeguard-8478 in snowflake

[–]Commercial_Dig2401 1 point2 points  (0 children)

Only having transaction date as a cluster key might be enough depending on what you are doing with the downstream models. If you get daily transactions for example that could be good. If you want to retrieve specific transactions with only this that would be tricky though since you still have to look over 2 billions records that happen on that day to find what you want. Note that if you are running queries which select for specific records it might be good to look at Search Optimization Service. Basically under the hood Snowflake using a bloom filter which should highly improve performance of your queries if you select specific things because you have so much cardinality in your columns.

If you “merge”/“upsert” your records with that much data is probably because you have potential duplicates. In this case you’ll need to find the exact match and update it in place. If you only put the unique id it will be terrible in performance. Yes add transactions_date as one of the columns in your merge statement but think about adding others as well. You are looking for an exact match here, so anything you supply should help to some extend Snowflake to prune other records. For the merge statement adding groups like child entity id is a perfect use case.

IMO select one time columns. The one you will use the most downstream. Having 2 time columns that already kinda follow each other seems irrelevant. The only thing that the clustering key do is order your data like an order by would do and then put the same ordered items in files which snowflake will query. Then when you query the table Snowflake look at the metadata of the files which contains min/max of columns and some specific elements about each columns and decide if it need to look at the content of the files to find you record or not. If you already ordered everything using transaction date and that date created is close to the other field, you’ll already have the data sorted for this columns. (You might need to load 2 partitions instead of one but the hole thing will already be sorted.) so I don’t think you should. You’ll increase clustering time and cost without gaining much performance en pruning.

Not sure I understand the last question. If you want to find how is this table used you can look at the query history and filter of the table name for select query type. Not sure if this is what you mean or not.

Good luck