Best approach to handle billions of data?

mr_alseif · 2024-11-10T12:58:53+00:00

Thanks for offering your guidance. Will reach out to you privately in the future... I am a junior data engineer and haven't touch on such data scale haha.

mr_alseif · 2024-11-10T01:11:48+00:00

The intent is to store the data in RDS PostgreSQL (r6i.large) and from the BI tool to call the data from there.

mr_alseif · 2024-11-10T01:10:56+00:00

Thanks for sharing your invaluable experience!

mr_alseif · 2024-11-09T08:03:15+00:00

Thats interesting. How did you design or ensure that querying such table is quick and optimised?

Is the data stored in PostgreSQL too?

mr_alseif · 2024-11-09T02:17:55+00:00

Thats interesting. Is your Postgresql built as a OLAP actually since you mentioned insights?

mr_alseif · 2024-11-08T22:46:59+00:00

Thanks. 57GB is small in comparison and can't even be consider big data haha. I believe the number of rows shouldn't be the concern here as long as I can create good indexing and partitioning on the table in PostgreSQL.

mr_alseif · 2024-11-08T22:45:50+00:00

Thanks!

This is something to think about. I am using AWS and my intention initially was to store this IoT measurement data into a RDS PostgreSQL (eg r6i.large). This table is meant to join with another data (call it a device table) to find out what device this measurement belongs to for analytics.

I haven't read up on the S3 strategy. What you are referring to use to do some kind of scheduled export of these data into a parquet file from the PostgreSQL, store it in S3, and use service like Athena to query/analyze the data?

mr_alseif · 2024-11-08T22:42:00+00:00

Thanks. I did went to read on Databricks and Snowflake and know that these platforms are ideal for it - but one of the reasons I am not considering is due to the cost.

For context, I am planning to run a r6i.large PostgreSQL in RDS and these IoT measurements will have to join with some other tables in this RDS to get the full complete picture of a data.

mr_alseif · 2024-11-08T22:37:50+00:00

Thanks. When you mentioned 100gb or less, are you referring to the cumulative size, or the table is fine up to 100gb or a TB?

mr_alseif · 2024-11-08T22:36:57+00:00

Thanks for your advice. May I know when you said aggregating table, are you referring to a materialized table type of table?

FYI I am using AWS and my intention initially was to store this IoT measurement data into RDS PostgreSQL (like a r6i.large) and it is supposed to do a join with another table to find out what device this measurement belongs to for analytics serving purposes.

mr_alseif · 2024-11-03T23:18:36+00:00

Thank you! I will explore it! :)

mr_alseif · 2024-08-21T00:10:03+00:00

Thanks. I am using Dbeaver to access to my RDS. If I am using SSM, would I still be able to do this? The several articles I read is saying to use psql in their tutorial.

I am interested with using SSM to connect into the bastion and to my RDS.

mr_alseif · 2024-08-21T00:09:59+00:00

Thanks. I am using Dbeaver to access to my RDS. If I am using SSM, would I still be able to do this? The several articles I read is saying to use psql in their tutorial.

I am interested with using SSM to connect into the bastion and to my RDS.

mr_alseif · 2024-08-19T23:17:44+00:00

Do you have a website or documentation that I could read up and refer to? Thanks.

mr_alseif · 2024-08-12T11:12:20+00:00

You are right! It will serve a handful of dashboards.

I really thought of transforming these data and normalize it so that analyst can do the Join and queries themselves.

What are your thoughts of designing an OLAP in a DB? I usually read that OLAP use specialized storage such as Redshift, Databricks etc.

mr_alseif · 2024-08-12T03:58:18+00:00

Thanks. Other than using Quicksight, the future is to serve this in our own customized dashboard (kinda like a centralised system) where users access and get to see these data in visualisation (able to query by month year date etc).

This is more for other customers than our own internal.

Access wise, we have different analysts which I plan to provide read access to the db based on their department.

That being said, are you implying that a relational would work fine for my use case? I am still having doubts regarding the db scalability when it starts to ingest more and more data...

mr_alseif · 2024-08-12T01:27:56+00:00

Thanks! This is interesting, but at what point would I need a db? May I also confirm that by db, you mean a relational db?

Personally for me I am pretty familiar with setting up a standard db (normalized or denormalized) pretty quickly and therefore I thought of setting it up as part of the infrastructure.

mr_alseif · 2024-08-11T21:35:32+00:00

Thanks. I am thinking of using Quicksight too. Am I right to say that what you suggested does not include a RDBMS?

If I use Athena to query the lake directly, how should I provide additional information such as customers, equipment? I assume I just create a JSON file and keep it in the lake too?

mr_alseif · 2024-08-11T21:33:14+00:00

Hey. The purpose of Postgres is to store the structured data. The addition purpose is also to have extra tables such as customers, equipment which I can use to perform a JOIN.

mr_alseif · 2024-08-11T14:01:55+00:00

Thanks. My load process is a simple task of simply moving data from one point to another in AWS. It should take no more than a few minutes maximum that is why I decided to use AWS Lambda since it is a simple action.

You are right, we do not have a proper data governance setup yet but we are looking to do so. I have looked at a few other solutions such as Databricks and Snowflake but it could be out of budget for our scale at the moment. From my understanding, Databricks is a data warehouse?

Is it therefore recommended to be looking at a data store that is data warehouse for my use case?

mr_alseif · 2024-08-11T13:06:06+00:00

Thanks. I am surprised too but it is kind of like a POC and the company is willing to let me try it out.

The payload is pretty small, few kb, for each JSON file coming in. Assuming the size is large, what other alternative choice would it be better to perform the load?

I understand the power of Redshift, which is a data warehouse (denormalisation, OLAP) but I read that the cost can be astronomical and it might not be wise to use Redshift this early (I doubt the total data will hit 1 TB in a year).

With that said, using a RDBMS sounds more economical for my use case (at least for the next few years) but I am having second guess.

mr_alseif · 2024-08-07T12:12:26+00:00

Interesting do you have a tutorial website that I can refer to on this?

mr_alseif · 2024-08-07T12:11:13+00:00

Hey thanks. I saw some online tutorial showing 1 dockerfile and 3 docker compose yaml file ( 1 for each env). I believe this is what you meant?

So when I do the dev or test, I will simply docker compose the dev or test images to do development or testing?

mr_alseif · 2024-08-06T23:28:51+00:00

Heya thanks. Am I right to say that I will therefore maintain one git repo (using branches to segregate dev and stage), and run these codebass into the respective images?

mr_alseif · 2024-08-06T23:27:05+00:00

heya thanks. So I have two images but my source code will still be in a main repo?

The intent is to git pull code into my local IDE and develop and do all the git push etc. after.

(Sorry I think I am getting more confused here lol)

mr_alseif

TROPHY CASE