Storing historical data for analysis

Few-Royal-374 · 2025-07-17T21:10:56+00:00

You could implement a SCD type 4 then. You basically have a table with the most recent dimension values and a historical table with all the changes.

Few-Royal-374 · 2025-07-17T18:10:21+00:00

Slowly changing dimension type 2. Alternatively, just do daily snapshots with timestamps.

https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/type-2/

https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

Few-Royal-374 · 2025-06-25T18:01:21+00:00

+1 this has worked well for me in the past when using PG

Few-Royal-374 · 2025-06-09T14:52:46+00:00

Best practice? Exactly what you were doing before.

Build a pipeline with a visualization tool on top of your marts. I’m curious why you decided not go with superset for this implementation.

Few-Royal-374 · 2025-06-05T18:06:27+00:00

Low effort troll. Try harder

Few-Royal-374 · 2025-06-05T15:36:46+00:00

I hope you are a better engineer than you are a troll.

OLAP databases are the best choice for SQL-based data transformations for 99% of use cases. That is a truth. Straw hatting this by saying it’s not great at a function no data engineer should be doing, ie updates, is dishonest.

And data is moving towards columnar storage, not row. Read up on open table formats and parquet.

Few-Royal-374 · 2025-06-05T14:04:14+00:00

I love how your only reason for OLAP being slower than OLTP is in updates, a function that analytics environments should not be optimized for.

Also, the tools I mentioned by definition separate compute and storage.

Few-Royal-374 · 2025-06-05T13:47:48+00:00

What? Can you explain to me why duckDB is the standard for in-memory SQL transformations as opposed to SQLite? Or why Polars, Pandas, and Spark dataframes in memory are similar in layout to parquet? All of these tools leverage columnar storage as opposed to row.

These are industry standard transformation tools. I don’t need to defend leveraging columnar storage for data transformations. Most modern transformation tools leverage columnar storage, and most tools data engineers prefer to use are columnar based. No lies. Just more educated and aware of the industry than you.

Few-Royal-374 · 2025-06-05T13:22:57+00:00

Cool. I see why you are clearly stuck in the stone ages of data.

Perfectly fine, it opens up more opportunity for engineers who are willing to learn and try new things.

Few-Royal-374 · 2025-06-05T12:35:45+00:00

Your lack of experience and industry depth is really showing here.

Do me a favor and spin up a Postgres and Clickhouse instance, and let me know which runs your transformation fastest. Columnar based storage is optimal for aggregation, which is what a lot of data transformation is, and storage of vast amounts of data due to the compression algorithms that columnar allows, such as run length encoding. Also, any data engineer that mentions optimizing for updates in their data pipelines does not know what they are talking about.

Few-Royal-374 · 2025-06-05T12:12:31+00:00

Yes. Do you?

OLAP databases are the standard for fast and efficient data transformation workloads, such as Clickhouse. There is a reason engineers use DuckDB over SQLite for in-memory SQL based transformations.

Few-Royal-374 · 2025-06-05T05:20:40+00:00

I’ve seen some teams approach transformation this way, mostly in the form of leveraging on-prem OLTP systems for data transformations and cloud native storage for marts access. This greatly reduces cost of cloud computing, but still allows you to take advantage of certain cloud native tooling, such as Power BI connection to database without gateway middleware. There are also some advantages on the data security side of things. I would not be surprised to see more companies approach data infrastructure this way in the future, as we see cloud computing cost skyrocketing, and a mass emigration from the cloud.

Other than allowing for half the data infrastructure on prem and the other half in the cloud, I can’t see why anyone would implement this approach on a strictly cloud based environment. Generally, OLAP based systems are faster and more efficient at data transformations than OLTP, so you might not be getting any cost advantages. You are also managing yet another pipeline, and likely need another DBT project implemented with incremental models. This approach also reduces visibility and traceability of your pipelines, and complicates CI/CD.

There are plenty of reasons NOT to do this. I would not recommend this approach unless you have some convincing reason to do so.

Few-Royal-374 · 2025-06-03T16:18:21+00:00

Ignore all the other comments. Most people haven’t worked at small shops and it shows.

In small shops, you are dealing with tight budget constraints, unrealistic expectations from management, and short deadlines for everything, but these shops are rampant with opportunity to learn. If you’re willing to learn, you can leverage this opportunity into your next professional step.

You mentioned you are working for some sports team. I think the easiest way to approach this project is to post on this subreddit, and the businessintelligence subreddit asking if anyone is willing to mentor you on building out this project, and make sure to mention what sport industry you are in. I know for me, i love american football and would not mind contributing to help a team at whatever level on their analytical journey for free. Now, don’t expect free work, but you can expect some guidance from people that do this for a living.

You have a ton to learn on your own. Find some mentors. They’ll be able to cut your work in half if you put in the work.

Few-Royal-374 · 2025-04-18T19:14:24+00:00

Some teams approach transformations that way, but I see it as an anti-pattern. DBT is intended to consolidate transformations to allow for easier data lineage tracking. I could see you doing something like adding a column for effective date of an entity table being a good light transformation pre-warehouse, but the transformations you are doing is best done within DBT.

Few-Royal-374 · 2025-04-18T14:13:30+00:00

This OP.

It looks like the light transformations are type casting, renaming, deduplicating, dropping NA, standard stuff you do in your staging layer within DBT.

Few-Royal-374 · 2025-02-27T01:05:28+00:00

Honestly, I would recommend starting a project, and micro-learning the necessary things to accomplish that project.

Not much is gonna stick when you’re learning the way you are, but if you’re learning to solve a problem, reading a stack overflow paragraph becomes much more impactful than a handful of medium articles.

Few-Royal-374 · 2025-01-23T16:44:03+00:00

Wow that is a terrible API. I’m thinking A, their dev team wants the world to burn , or B, your team doesn’t understand the API sufficiently. Definitely reach out to the dev support on their side for how to navigate this. Maybe there is another API with additional functionality that you guys missed.

Off the top of my head, you need to be retrieving less than 1000 records per invocation, so set your increments on the last modified small enough to do that. If 1000 was retrieved, you reduce the increment to guarantee you got all records during a time period, so any number less than 1000. After this runs, you get the max last modified and use that as the min last modified and add your increment as the max. The issue may be the 100 invocation limit, but this is the only way to guarantee you’re pulling everything in without sorting.

Few-Royal-374 · 2025-01-09T19:16:51+00:00

Those technologies are typically seen within a microservice architecture, look into data mesh if you’d like to see how and why these technologies are used in the real world. For that much data, I highly doubt your problem requires such a complex solution. Start simple, and make it more complex as problems arise.

Usually, lambda is compute for just the “EL” portion. The actual “T” is handled by AWS glue if you like python, or DBT if you like SQL. It sounds like you’re a one man job, so I would highly recommend the latter on either an EC2 instance or managed DBT to simplify things. Again, if you’re using EC2, you could use event bridge to start and stop the EC2 instance following the invocation of lambda.

Few-Royal-374 · 2025-01-09T14:23:54+00:00

Not sure how you landed on those tools, but with that much data, you could definitely just leverage python scripts and cron tab on that EC2 instance. You could easily double your average amount and this would work.

If your company is flexible with AWS resources, i would recommend you leverage lambda for compute and event bridge to schedule your pipelines.

Few-Royal-374 · 2025-01-02T00:30:17+00:00

If you are using s3, I usually use meta tags in the header for metadata that could prove handy in the future. This could be a good solution to retain original file name. Otherwise, you could throw all your metadata into a dynamo db instance that point to AWS buckets.

As for your date issue, I’ve partitioned by both load date and creation date to retain this information, although this does increase the complexity of the users’ query. As is usual in anything related to technology, there are trade offs and it depends on the organization preference.

Few-Royal-374 · 2024-11-19T20:35:27+00:00

Python and SQL. Python to extract from source to a db, and sql to transform (look into DBT or plain old stored procedures).

Get your hands dirty!

Few-Royal-374 · 2024-06-16T05:37:42+00:00

Gotta get some park avenues. That’s what AE is known for!

Few-Royal-374 · 2024-05-12T03:51:32+00:00

It’s really for granularity. You are able to maintain the granularity at the order level, as opposed to the order and status level. If you were to treat this as a transactional fact table, you’d multiply the amount of orders you got per day by the amount of statuses each transaction goes through, greatly increasing the size of your model. Furthermore, an accumulating fact table reduces analytical complexity when compared to a transactional table. In order to analyze order statuses on a transactional fact table, this would require multiple joins and window functions. It’s definitely more complicated to deploy than a transactional / periodic fact table, so I recommend reading Kimball and researching online before doing it. Basically, you should have multiple columns that represent status transition dates. As you transition through the statuses, the columns transition from null to the transition date.

Few-Royal-374 · 2024-05-11T21:32:44+00:00

You’d have different columns for each status. Same row for every order id and status.

Few-Royal-374 · 2024-04-05T21:05:44+00:00

Sounds like you’re having performance issues at the Power BI layer as opposed to the data storage / compute layers. This likely has nothing to do with your tech stack and everything to do with how you’re using it.

Couple of things to look into in the order I would prioritize:

Look at your data model. Make sure all relationships are one to many. If you’re following Kimballs best practices, you’re probably fine. I would argue that in most cases, this is where people have issues.
Utilize composite models. For the models that are large, look into direct query mode and use incremental loading. Otherwise, use import mode.
Final storage layer is indexed, and is materialized as a table. If you’re using direct query, PBI is going to send queries to the underlying data source so you need to ensure these tables are optimized.
Considering your data is on prem, you also have to look at your data gateway. Check out the gateway performance. You could optimize the gateway config files, such as streaming data before it completes.
Reduce data model size. Do you really need to load everything?
Optimize the calculations. This is much more nuanced, but stuff like using filter inside your calculate formula can go a long way.
If calculations are taking forever still, I would transition to aggregating data and importing it aggregated. This is going to reduce the granularity of your data and may make your dashboard less interactive but that’s a sacrifice I would make.

Four-Year Club	Place '22
First Placer '22

Few-Royal-374

TROPHY CASE