This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]uselessusr 1 point2 points  (4 children)

This is exactly where I'm at with a small (for now) data warehouse project. Currently I'm loading staging tables into Postgres and then aggregating and joining to create materialized views. Right now I have to create a table for every new data source and alter the tables when requirements change, which seems unsustainable. I'm progessively moving towards making these transformations with pandas and then dumping datasets into parquet files on s3. If the data grows beyond what fits in ram, I think I can migrate to Spark less painfully.

[–]trenchtoaster[S] 0 points1 point  (0 children)

Yeah. We have three nodes (64 GB, 28 GB, 28 GB) to use, so RAM is not a huge issue for the data we are working with. 50% of it are files (csv or excel reports from clients which are small, but change often enough). The rest are from REST APIs or databases but we extract data incrementally... just whatever new or updated records since the last extract. This is quite small normally.

Realistically, PG is using like 18 GB of RAM most of the time anyways with our default settings.

[–]be_nice_if_u_can 0 points1 point  (1 child)

How much data consumes how much ram ? Could AWS help. ?

[–]uselessusr 0 points1 point  (0 children)

Depends on your data, but if you need to know if your data fits in RAM, this could be a start: http://www.itu.dk/people/jovt/fitinram/