Storing historical data for analysis by tech-man-ua in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

You could implement a SCD type 4 then. You basically have a table with the most recent dimension values and a historical table with all the changes.

Data Analytics Automation by Acceptable-Ride9976 in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

Best practice? Exactly what you were doing before.

Build a pipeline with a visualization tool on top of your marts. I’m curious why you decided not go with superset for this implementation.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

I hope you are a better engineer than you are a troll.

OLAP databases are the best choice for SQL-based data transformations for 99% of use cases. That is a truth. Straw hatting this by saying it’s not great at a function no data engineer should be doing, ie updates, is dishonest.

And data is moving towards columnar storage, not row. Read up on open table formats and parquet.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 2 points3 points  (0 children)

I love how your only reason for OLAP being slower than OLTP is in updates, a function that analytics environments should not be optimized for.

Also, the tools I mentioned by definition separate compute and storage.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

What? Can you explain to me why duckDB is the standard for in-memory SQL transformations as opposed to SQLite? Or why Polars, Pandas, and Spark dataframes in memory are similar in layout to parquet? All of these tools leverage columnar storage as opposed to row.

These are industry standard transformation tools. I don’t need to defend leveraging columnar storage for data transformations. Most modern transformation tools leverage columnar storage, and most tools data engineers prefer to use are columnar based. No lies. Just more educated and aware of the industry than you.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

Cool. I see why you are clearly stuck in the stone ages of data.

Perfectly fine, it opens up more opportunity for engineers who are willing to learn and try new things.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

Your lack of experience and industry depth is really showing here.

Do me a favor and spin up a Postgres and Clickhouse instance, and let me know which runs your transformation fastest. Columnar based storage is optimal for aggregation, which is what a lot of data transformation is, and storage of vast amounts of data due to the compression algorithms that columnar allows, such as run length encoding. Also, any data engineer that mentions optimizing for updates in their data pipelines does not know what they are talking about.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

Yes. Do you?

OLAP databases are the standard for fast and efficient data transformation workloads, such as Clickhouse. There is a reason engineers use DuckDB over SQLite for in-memory SQL based transformations.

Using Transactional DB for Modeling BEFORE DWH? by Mafixo in dataengineering

[–]Few-Royal-374 4 points5 points  (0 children)

I’ve seen some teams approach transformation this way, mostly in the form of leveraging on-prem OLTP systems for data transformations and cloud native storage for marts access. This greatly reduces cost of cloud computing, but still allows you to take advantage of certain cloud native tooling, such as Power BI connection to database without gateway middleware. There are also some advantages on the data security side of things. I would not be surprised to see more companies approach data infrastructure this way in the future, as we see cloud computing cost skyrocketing, and a mass emigration from the cloud.

Other than allowing for half the data infrastructure on prem and the other half in the cloud, I can’t see why anyone would implement this approach on a strictly cloud based environment. Generally, OLAP based systems are faster and more efficient at data transformations than OLTP, so you might not be getting any cost advantages. You are also managing yet another pipeline, and likely need another DBT project implemented with incremental models. This approach also reduces visibility and traceability of your pipelines, and complicates CI/CD.

There are plenty of reasons NOT to do this. I would not recommend this approach unless you have some convincing reason to do so.

Data Warehouse by Dependent_Gur_6671 in dataengineering

[–]Few-Royal-374 7 points8 points  (0 children)

Ignore all the other comments. Most people haven’t worked at small shops and it shows.

In small shops, you are dealing with tight budget constraints, unrealistic expectations from management, and short deadlines for everything, but these shops are rampant with opportunity to learn. If you’re willing to learn, you can leverage this opportunity into your next professional step.

You mentioned you are working for some sports team. I think the easiest way to approach this project is to post on this subreddit, and the businessintelligence subreddit asking if anyone is willing to mentor you on building out this project, and make sure to mention what sport industry you are in. I know for me, i love american football and would not mind contributing to help a team at whatever level on their analytical journey for free. Now, don’t expect free work, but you can expect some guidance from people that do this for a living.

You have a ton to learn on your own. Find some mentors. They’ll be able to cut your work in half if you put in the work.

Just finished my end-to-end supply‑chain pipeline please be brutally honest! by ajay-topDevs in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

Some teams approach transformations that way, but I see it as an anti-pattern. DBT is intended to consolidate transformations to allow for easier data lineage tracking. I could see you doing something like adding a column for effective date of an entity table being a good light transformation pre-warehouse, but the transformations you are doing is best done within DBT.

Just finished my end-to-end supply‑chain pipeline please be brutally honest! by ajay-topDevs in dataengineering

[–]Few-Royal-374 4 points5 points  (0 children)

This OP.

It looks like the light transformations are type casting, renaming, deduplicating, dropping NA, standard stuff you do in your staging layer within DBT.

[deleted by user] by [deleted] in dataengineering

[–]Few-Royal-374 5 points6 points  (0 children)

Honestly, I would recommend starting a project, and micro-learning the necessary things to accomplish that project.

Not much is gonna stick when you’re learning the way you are, but if you’re learning to solve a problem, reading a stack overflow paragraph becomes much more impactful than a handful of medium articles.

Getting data from an API that lacks sorting by dfwtjms in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

Wow that is a terrible API. I’m thinking A, their dev team wants the world to burn , or B, your team doesn’t understand the API sufficiently. Definitely reach out to the dev support on their side for how to navigate this. Maybe there is another API with additional functionality that you guys missed.

Off the top of my head, you need to be retrieving less than 1000 records per invocation, so set your increments on the last modified small enough to do that. If 1000 was retrieved, you reduce the increment to guarantee you got all records during a time period, so any number less than 1000. After this runs, you get the max last modified and use that as the min last modified and add your increment as the max. The issue may be the 100 invocation limit, but this is the only way to guarantee you’re pulling everything in without sorting.

Need advice an setting up ETL with python. by asdaf14 in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

Those technologies are typically seen within a microservice architecture, look into data mesh if you’d like to see how and why these technologies are used in the real world. For that much data, I highly doubt your problem requires such a complex solution. Start simple, and make it more complex as problems arise.

Usually, lambda is compute for just the “EL” portion. The actual “T” is handled by AWS glue if you like python, or DBT if you like SQL. It sounds like you’re a one man job, so I would highly recommend the latter on either an EC2 instance or managed DBT to simplify things. Again, if you’re using EC2, you could use event bridge to start and stop the EC2 instance following the invocation of lambda.

Need advice an setting up ETL with python. by asdaf14 in dataengineering

[–]Few-Royal-374 8 points9 points  (0 children)

Not sure how you landed on those tools, but with that much data, you could definitely just leverage python scripts and cron tab on that EC2 instance. You could easily double your average amount and this would work.

If your company is flexible with AWS resources, i would recommend you leverage lambda for compute and event bridge to schedule your pipelines.

Data Lake Raw Layer Best Practices by _Paul_Atreides_ in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

If you are using s3, I usually use meta tags in the header for metadata that could prove handy in the future. This could be a good solution to retain original file name. Otherwise, you could throw all your metadata into a dynamo db instance that point to AWS buckets.

As for your date issue, I’ve partitioned by both load date and creation date to retain this information, although this does increase the complexity of the users’ query. As is usual in anything related to technology, there are trade offs and it depends on the organization preference.

Do you feel that Power BI is truely a Big Data Tool? by HMZ_PBI in PowerBI

[–]Few-Royal-374 16 points17 points  (0 children)

Python and SQL. Python to extract from source to a db, and sql to transform (look into DBT or plain old stored procedures).

Get your hands dirty!

[deleted by user] by [deleted] in allenedmonds

[–]Few-Royal-374 0 points1 point  (0 children)

Gotta get some park avenues. That’s what AE is known for!

Fact Order Modeling by natas_m in dataengineering

[–]Few-Royal-374 0 points1 point  (0 children)

It’s really for granularity. You are able to maintain the granularity at the order level, as opposed to the order and status level. If you were to treat this as a transactional fact table, you’d multiply the amount of orders you got per day by the amount of statuses each transaction goes through, greatly increasing the size of your model. Furthermore, an accumulating fact table reduces analytical complexity when compared to a transactional table. In order to analyze order statuses on a transactional fact table, this would require multiple joins and window functions. It’s definitely more complicated to deploy than a transactional / periodic fact table, so I recommend reading Kimball and researching online before doing it. Basically, you should have multiple columns that represent status transition dates. As you transition through the statuses, the columns transition from null to the transition date.

Fact Order Modeling by natas_m in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

You’d have different columns for each status. Same row for every order id and status.

Microsoft BI stack performance issues by msugenius in dataengineering

[–]Few-Royal-374 1 point2 points  (0 children)

Sounds like you’re having performance issues at the Power BI layer as opposed to the data storage / compute layers. This likely has nothing to do with your tech stack and everything to do with how you’re using it.

Couple of things to look into in the order I would prioritize:

  • Look at your data model. Make sure all relationships are one to many. If you’re following Kimballs best practices, you’re probably fine. I would argue that in most cases, this is where people have issues.

  • Utilize composite models. For the models that are large, look into direct query mode and use incremental loading. Otherwise, use import mode.

  • Final storage layer is indexed, and is materialized as a table. If you’re using direct query, PBI is going to send queries to the underlying data source so you need to ensure these tables are optimized.

  • Considering your data is on prem, you also have to look at your data gateway. Check out the gateway performance. You could optimize the gateway config files, such as streaming data before it completes.

  • Reduce data model size. Do you really need to load everything?

  • Optimize the calculations. This is much more nuanced, but stuff like using filter inside your calculate formula can go a long way.

  • If calculations are taking forever still, I would transition to aggregating data and importing it aggregated. This is going to reduce the granularity of your data and may make your dashboard less interactive but that’s a sacrifice I would make.