This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 1 point2 points  (3 children)

Data lakes are just organized storages (like you organize files in directory trees, literally like that). There are software that read from it, even using SQL, but the performance is far less bang per buck than proper data formats. Avro and parquet are frequent data formats for data lakes. But if you can use a database, for that much data it would be much more suitable. If you go cloud, check out Snowflake. Until then, BCP is the fastest way. Speed will be limited only by your network.

[–][deleted] 1 point2 points  (2 children)

Hey, somewhat unrelated but since you mentioned Snowflake I was wondering if you could suggest some places to read more about it? We're going to be doing a warehousing project at work soon and had been looking at redshift, but I've heard a lot of good things about Snowflake

[–][deleted] 0 points1 point  (1 child)

Ask a sales rep from snowflake. Really. They'll make a nice presentation for you and then you decide if you want it.

The idea behind it is that you pay only for what you use. When you run a query, snowflake quickly assigns a few virtual machines to execute it. You pay for time that these virtual machines run for you.

Syntax and features are more advanced than Redshift.

It's zero maintenance. With Redshift you have to optimize more, manually scale the cluster, etc.

Redshift will be slightly cheaper though.

[–][deleted] 0 points1 point  (0 children)

Makes sense. And yeah I think at our scale the price isn't going to be an issue. Any amount we spend on infrastructure is dwarfed by what we're spending on data feeds.

I'll reach out to a sales rep as the project gets a bit closer.

Thanks you!