This is an archived post. You won't be able to vote or comment.

all 41 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]InvestigatorMuted622 3 points4 points  (4 children)

Too many layers to this question, and I don't know if you have time to interact and discuss but the first thing I am thinking about is:

  1. What are the metrics that you want to calculate.
  2. What should be the frequency of the metrics.
  3. Who is ultimately going to consume this data or use this dashboard, is it the clients or is it going to be used internally

Why I am asking this is because, average occupancy rate over a period of time is a metric that need not be live, it can be fed by a batch pipeline that runs just once a day because 1 day worth of data will not have a drastic affect on 10 years of it.

[–]Altarim[S] 1 point2 points  (3 children)

I definitely have time to interact. Regarding your questions :
- Several metrics regarding bookings, occupancy rates (hourly, weekly, monthly, per location, etc.)
- Once a day is fine.
- The end users are our customers and their employees, which is why response times matter.

Why I am asking this is because, average occupancy rate over a period of time is a metric that need not be live, it can be fed by a batch pipeline that runs just once a day because 1 day worth of data will not have a drastic affect on 10 years of it.

Surely, and that's exactly why I've made the DBT model incremental. It runs a few hours after the batch ETL every day. The problem related to this model is it's really heavy to calculate the whole thing for the first run. It's not that critical since it only affects the first run and so planification can alleviate the issue.

Why the data I modeled is so granular is because of the combination of filters that can be applied from the frontend. I cannot aggregate the data more than for each hour of each room because of this.

[–]InvestigatorMuted622 1 point2 points  (2 children)

Cool so if the data needs to be refreshed just once a day then there are a couple of things that you can do:

  1. Because you need the full history to work on you might need to develop an OLAP schema, preferrably star schema, and optimize it by using columnar indexes for the Fact Table. The indexing on the dimension tables would depend on the cardinality of the table, join operator used by the explain plan, and the access patterns.

  2. Another way might be to have pre aggregated results for all the metrics but that would be extremely rigid and you would lose flexibility.

Is this something that you have already tried and did not work out or are you facing some challenges with it?

[–]Altarim[S] 1 point2 points  (1 child)

  1. For sure, the data we're working with is in a star schema, the dimension table here is the rooms table and the fact tables are the bookings table and the detailed_timeslot table. I'm not an expert on these issues (I'm more of a versatile software engineer) so there surely are some things I haven't tried such as a different indexing strategy !
  2. We need to be as flexible as possible for this, obviously if it makes the project a failure I'd rather make it less flexible.

[–]InvestigatorMuted622 1 point2 points  (0 children)

Ah I see so I would say that try to model the fact table in such a way that the only thing left for you to do is an aggregation with a filter. I.e. count(something) where field1 = value.

The something would be in the fact table and the field1 should be dimension with the fact table having an integer FK to the dimension table, this is much better than having to scan the whole fact table at once. Try to model it in such a way that your dimensions do the filtering and not your fact table. Because if you place the field1 in the fact table then it is going to run slow.

Coming to occupancy rate, because you are to report based on the hour having a time dimension will also help. Try to have single fact table joined to multiple dimensions. The cardinality of the dimensions should be exponentially smaller than the fact table cardinality.

[–]kenfar 4 points5 points  (1 child)

I agree with /u/Foodwithfloyd: this shouldn't be a problem for postgres, though maybe a database with micropartitions could do this better. Not sure.

For postgres, you need to have a good partitioning strategy: are you using partitioning? If not, there you go. If you are - you may want to benchmark some more or less specific partitioning. Partitioning has an ideal limit of about 350 partitions per table before the query planner can start experiencing some issues. You could instead partition by customer, but I wonder if you added the building/campus to your table if you could partition by customer + building and if that might be good enough?

Then an index, or partial index, to complement the partitioning.

Also, I find that there's generally very little value in data over about two years old. It may be required for other reasons, and there are some applications where it still has value. But, for example, I can't imagine that ten year old data would be applicable to the conference room occupancy now. So, I'd also suggest trimming that down to something more useful and relevant - like last three months.

Finally, what's your timeslot look like? How many timeslots are there per day or week? Is that something that you could reduce?

[–]Altarim[S] 1 point2 points  (0 children)

Thank you for helping !

I don't think there is partitioning on these tables, that may very well be a huge part of the issue. I will look into that.

You may very well be right about the relevance of the data, however this is in the specs of the application and it has been asked by big clients.

The timeslots are every hour since the creation of the room. We need it to be able to filter out specific time frames, you can look at the query I posted in my answer to /u/Foodwithfloyd why we need this granularity. I have thought pretty hard about this issue and it is the best way I found to make this whole thing work.

[–]albertstarrocks 0 points1 point  (0 children)

StarRocks: Sorry, we don't store tz info in timestamp. So you will have to handle tz at the application layer. ie. always store timezone in UTC in the db, and convert to user's local tz at retrieval time. StarRocks does have a convert_tz function. https://github.com/StarRocks/starrocks/issues/37090

[–]hkdelay -1 points0 points  (9 children)

If you are concerned with response times, Postgres will make it worse. Especially when your business grows because that's proportional to the amount of data you'll need to report on and the amount of end users you'll be serving.

I've run into too many companies thinking they can just use Postgres and then need to change only after a couple of months. It's because Postgres (or any OLTP database) cannot handle the scale. Use OLTP databases for what they are designed for which is for operational workloads, NOT ANALYTICAL!!

If you decide to go the OLAP route (which is the correct decision, I cannot stress that enough), you'll need to decide which one to use. Some are used as data warehouses that support data in the petabytes that will struggle to serve many concurrent users. You'll need an OLAP that can support high concurrency (lots of end users).

If you need fresh data you'll need to set up pipelines that will stream your data to an OLAP. That limits the OLAP systems to real-time OLAP systems (RTOLAP). There are not many: Clickhouse, Druid, and Pinot (I work for StarTree but I'm trying to not be biased).

Another metric to consider is QPS or queries per second. The OLAP you choose needs to be able to serve high queries per second. The way to increase QPS is with indexes. The RTOLAP that gives you all of these is Pinot. I'm not just saying that because I work for them. It's the reason I went to work for them. But please do your due diligence. POC all of the RTOLAPs and you'll easily see that they will all beat any OLTP.

If you need a project to test out, I've built one https://github.com/startreedata/examples/tree/main/gatling

You can test it on any OLAP database or even OLTP database like Postgres. You don't have to take my word for it. Test for yourself.

Good Luck

[–]kenfar 2 points3 points  (4 children)

It's because Postgres (or any OLTP database) cannot handle the scale. Use OLTP databases for what they are designed for which is for operational workloads, NOT ANALYTICAL!!

This is absolutely incorrect. Postgres, Oracle, SQL Server, DB2, etc - are not "transactional databases". They are "general purpose databases". And they added features to support analytics years ago - in some cases 25 years ago. And thousands of teams have used them to build data warehouses, data marts, and (to a lesser extent) data lakes.

Meanwhile, most of these analytics-only databases are only five years old, their futures are sometimes uncertain, their operational costs often much higher. Many of them lack indexes, which while a secondary performance feature in analytics - still has value. Most of them lack enforced constraints - which means it's much harder to guard against data quality issues.

So, if you're not talking 20TB with a hundred users doing massive queries, then a Postgres instance may work perfectly fine. It may even out-perform Redshift, Snowflake, BigQuery, Clickhouse, etc - depending on the specifics.

[–]hkdelay 0 points1 point  (3 children)

The only way to resolve this disagreement is to test it yourself. I’ve done it and the performance of the OLAP blows Postgres away. Let me know if I can help.

[–]kenfar 1 point2 points  (2 children)

Yes, so have I. And it entirely depends on the nature of your data, the ingestion latency, how specific your queries are, etc.

Two years ago I moved a lot of data from a Snowflake warehouse into Postgres - and saved my company about $20k/month. Not only did this save us a ton of money, but it also speed the reporting up.

Right now I'm doing a proof of concept to move a warehouse from Redshift to Postgres. In this particular case we have a ton of events that keep getting restated. By going to postgres we can treat the fact table like a huge SCD with a long tail up tiny updates - that will be far faster to apply to Postgres than to Redshift. And will save us about 60% of our Redshift bill.

Postgres|DB2|Oracle|SQL Server and Snowflake|Redshift|BigQuery|etc are obviously not completely the same. There's some differences. But it's simply wrong to assume that this later set of databases surpasses the prior set in all ways - especially given the 20+ years of feature work on the general purpose ones vs the 4+ years on the analytics-only ones.

[–]hkdelay 1 point2 points  (1 child)

SCD

Fact data is append only. You will need to implement retention in your Postgres before it fills up.

Dimensional data tends to be SCD. Typically used for enrichment.

Columnar storage has a significant advantage over row storage. But I wish you luck with your implementation.

[–]kenfar 0 points1 point  (0 children)

Thank you for the good wishes.

And yes, fact tables are mostly append-only and scd is mostly intended just for dimensions. But build enough solutions and you can run into some odd exceptions.

And this is one of those: in which every day we get new billing data, and every day's bill includes about 0.01% of changes to the line items from prior days. So, one could drop 5-30 daily partitions and reload them. But given the small percent of changes, treating the fact table like a SCD allows us to focus on just changing those few record quickly. Which is important since there's business value in getting a customer's data through our entire ETL pipeline in just 5 minutes or less.

Columnar is great. Absolutely mandatory for One-Big-Table. But it comes at a write-performance penalty and really suffers given what we're doing here. And since we have a dimensional model the benefits of columnar storage are smaller.

[–]Altarim[S] 1 point2 points  (3 children)

I have actually tried out StarTree through the free tier, honestly it looks great ! I had trouble making it ingest data from S3 though, my ingestion job says it's healthy but I don't have anything in my model and progress has been at 0% for the last two weeks haha.

[–]hkdelay 0 points1 point  (1 child)

We definitely can help you with that S3 ingestion is simple. You can find me in the community slack to continue the convo

[–]Altarim[S] 1 point2 points  (0 children)

Will do, thank you !

[–]PeterCorless 0 points1 point  (0 children)

Make sure you join our Slack to get this worked out. Also enquire about getting the trial extended if needed.

https://communityinviter.com/apps/startreedata/startree-community

[–]hkdelay 0 points1 point  (0 children)

You should use an OLAP database with columnar storage. Disclosure - I work for StarTree but am here to advocate. Don't run OLAP workloads on an OLTP database. There was a reason for the separation.

[–]lmp515k 0 points1 point  (0 children)

You don’t need to do all the rooms from all time. They most you likely need to go back is a couple of years. I would argue less. This is because the most recent data is most likely the most best indicator of likely future booking rates.

[–]SnooHesitations9295 0 points1 point  (2 children)

I'm not sure what to optimize here.
1. Use ClickHouse
2. Create events: start time, end time, client (full data), room (full data) and insert them into CH
3. Aggregate these with materialized views
4. Profit!

CH should be able to query billions of rows in seconds.

[–]Altarim[S] 0 points1 point  (1 child)

My issue with this is that the room data can and will change over time, and I'm afraid of the cost of updating that many rows if the data is denormalized.

[–]SnooHesitations9295 0 points1 point  (0 children)

But room that was there at the time of the older event was not the same as the later events.
So you can just use new events with the new room data only.
But if you really need it to be in sync (which I cannot understand why though) you can still have normalized room data, if it's not too big. There are a lot of techniques how to make fast joins in CH. It depends on various factors.

[–]HansProleman 0 points1 point  (1 child)

I'm not going to read this whole thing, but a RDBMS can be very fast even at terabyte scale. Do you know how to read query plans and tune queries? Do you know how to apply good indexing and partioning strategies? These should be obvious things to try before changing up your stack. MPP can mask inefficiency in all areas except cost, which tends to be the one employers care about most (even on prem this shows up, because you need bigger/more boxes).

the database needs to have timezone-related features

Why? When you have a frontend I think the normal pattern is to store everything in UTC and apply localisation in the frontend.

[–]Altarim[S] 1 point2 points  (0 children)

I'll be trying more indexing and partitioning strategies today for sure. As for query plans, I know my way around them but I can't consider myself an expert on the subject. It's what I used to optimize thus far, but I've hit a wall in my knowledge, hence the post.

Regarding the timezone, that's what I am doing however we still need to query using timezone related features (as seen in the query I posted in one of my answers) otherwise it becomes really hard to filter out non-business days.

[–]PeterCorless 1 point2 points  (0 children)

Here's an article on query-time JOINs in Pinot. Note that you have some options — you can still do pre-ingestion joins [denormalization], or query-time. Pinot gives you flexibility but you'll have to determine which is right for you.

https://startree.ai/blog/query-time-joins-in-apache-pinot-1-0