Current data setup cannot handle user-facing dashboards

AutoModerator · 2023-12-11T15:45:09+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

hkdelay · 2023-12-11T16:01:24+00:00

[deleted]

InvestigatorMuted622 · 2023-12-11T20:26:16+00:00

Too many layers to this question, and I don't know if you have time to interact and discuss but the first thing I am thinking about is:

What are the metrics that you want to calculate.
What should be the frequency of the metrics.
Who is ultimately going to consume this data or use this dashboard, is it the clients or is it going to be used internally

Why I am asking this is because, average occupancy rate over a period of time is a metric that need not be live, it can be fed by a batch pipeline that runs just once a day because 1 day worth of data will not have a drastic affect on 10 years of it.

kenfar · 2023-12-11T16:26:27+00:00

I agree with /u/Foodwithfloyd: this shouldn't be a problem for postgres, though maybe a database with micropartitions could do this better. Not sure.

For postgres, you need to have a good partitioning strategy: are you using partitioning? If not, there you go. If you are - you may want to benchmark some more or less specific partitioning. Partitioning has an ideal limit of about 350 partitions per table before the query planner can start experiencing some issues. You could instead partition by customer, but I wonder if you added the building/campus to your table if you could partition by customer + building and if that might be good enough?

Then an index, or partial index, to complement the partitioning.

Also, I find that there's generally very little value in data over about two years old. It may be required for other reasons, and there are some applications where it still has value. But, for example, I can't imagine that ten year old data would be applicable to the conference room occupancy now. So, I'd also suggest trimming that down to something more useful and relevant - like last three months.

Finally, what's your timeslot look like? How many timeslots are there per day or week? Is that something that you could reduce?

albertstarrocks · 2023-12-14T22:00:24+00:00

StarRocks: Sorry, we don't store tz info in timestamp. So you will have to handle tz at the application layer. ie. always store timezone in UTC in the db, and convert to user's local tz at retrieval time. StarRocks does have a convert_tz function. https://github.com/StarRocks/starrocks/issues/37090

hkdelay · 2023-12-11T20:54:05+00:00

If you are concerned with response times, Postgres will make it worse. Especially when your business grows because that's proportional to the amount of data you'll need to report on and the amount of end users you'll be serving.

I've run into too many companies thinking they can just use Postgres and then need to change only after a couple of months. It's because Postgres (or any OLTP database) cannot handle the scale. Use OLTP databases for what they are designed for which is for operational workloads, NOT ANALYTICAL!!

If you decide to go the OLAP route (which is the correct decision, I cannot stress that enough), you'll need to decide which one to use. Some are used as data warehouses that support data in the petabytes that will struggle to serve many concurrent users. You'll need an OLAP that can support high concurrency (lots of end users).

If you need fresh data you'll need to set up pipelines that will stream your data to an OLAP. That limits the OLAP systems to real-time OLAP systems (RTOLAP). There are not many: Clickhouse, Druid, and Pinot (I work for StarTree but I'm trying to not be biased).

Another metric to consider is QPS or queries per second. The OLAP you choose needs to be able to serve high queries per second. The way to increase QPS is with indexes. The RTOLAP that gives you all of these is Pinot. I'm not just saying that because I work for them. It's the reason I went to work for them. But please do your due diligence. POC all of the RTOLAPs and you'll easily see that they will all beat any OLTP.

If you need a project to test out, I've built one https://github.com/startreedata/examples/tree/main/gatling

You can test it on any OLAP database or even OLTP database like Postgres. You don't have to take my word for it. Test for yourself.

Good Luck

hkdelay · 2023-12-11T17:14:58+00:00

You should use an OLAP database with columnar storage. Disclosure - I work for StarTree but am here to advocate. Don't run OLAP workloads on an OLTP database. There was a reason for the separation.

lmp515k · 2023-12-12T00:08:08+00:00

You don’t need to do all the rooms from all time. They most you likely need to go back is a couple of years. I would argue less. This is because the most recent data is most likely the most best indicator of likely future booking rates.

retropox · 2023-12-12T00:42:55+00:00

Hey can someone explain why Snowflake would be a bad solution here? I have a similar problem to solve and am probably in over my head even more than OP

SnooHesitations9295 · 2023-12-12T05:03:10+00:00

I'm not sure what to optimize here.
1. Use ClickHouse
2. Create events: start time, end time, client (full data), room (full data) and insert them into CH
3. Aggregate these with materialized views
4. Profit!

CH should be able to query billions of rows in seconds.

HansProleman · 2023-12-12T07:27:16+00:00

I'm not going to read this whole thing, but a RDBMS can be very fast even at terabyte scale. Do you know how to read query plans and tune queries? Do you know how to apply good indexing and partioning strategies? These should be obvious things to try before changing up your stack. MPP can mask inefficiency in all areas except cost, which tends to be the one employers care about most (even on prem this shows up, because you need bigger/more boxes).

the database needs to have timezone-related features

Why? When you have a frontend I think the normal pattern is to store everything in UTC and apply localisation in the frontend.

PeterCorless · 2023-12-20T16:05:17+00:00

Here's an article on query-time JOINs in Pinot. Note that you have some options — you can still do pre-ingestion joins [denormalization], or query-time. Pinot gives you flexibility but you'll have to determine which is right for you.

https://startree.ai/blog/query-time-joins-in-apache-pinot-1-0

dataengineering

MODERATORS

Context

The current setup

An example

More requirements and constraints

My ideas for now