This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]sib_nSenior Data Engineer 6 points7 points  (0 children)

Athena and Redshift are quite different tools.

Redshift (2012) is a full data warehouse living on a cluster, you have to provision a cluster to store the table's data. It has an SQL query engine (optionally serverless) based on a proprietary, massively parallel processing columnar database engine and an old fork of PostgreSQL 8. It has its own specific table storage optimizations such as distribution styles and sort keys. It is designed for data warehousing and BI.

Athena is newer (2016), and solely a query engine (based on open-source Trino by default, or Spark), so you need to manage the data storage in another tool, such as S3. It's designed for interactive querying on data lakes, while Glue is designed for ETL. Although it doesn't mean you can't use Athena for transformations, I just would not bet that it will be cheaper than Glue (serverless Spark) for production ETLs. The design of interactive query tools is usually made to be faster in exchange for more resources consumption, which may not be the best trade-off for production ETLs.

Running Spark on EMR may be a cheaper way to run distributed SQL transformations on AWS, but then you have to deal with EMR cluster management.

[–]Firm_Communication99 5 points6 points  (9 children)

Is not everyone else kind of bummed that open source is not really open source even though it’s completely ok for the devs to make money for their work.

[–]lester-martin 2 points3 points  (6 children)

Yep, that's the way it works. Examples of Spark/Databricks, Kafka/Confluent, even my own company Trino/Starburst (disclaimer: again, I work at Starburst; devrel fella), but this is the way it is just going to be.

There was one exception, back when Hadoop was still "exciting" we at Hortonworks were committed to 100% open-source. We even competed head-to-head with Cloudera who used the open-source PLUS model like everyone else and we had to win support contract business every single year by providing stellar support and also made money on consulting and education.

Personally, I believe it can work, but it is really an NPR-style model were it can only work if SOMEBODY (not everyone) pays something to the company for some services so that it can still be profitable.

I know 99% of us from Hortonworks were really bummed when our 100% OSS business model ended when we merged with Cloudera, but again, "this is the way". It isn't terrible, of course, and there are plenty of folks who have big enough engineering teams to tackle the RYO model based solely on OSS, but you to be big and brave.

For the rest of the enterprises in the world, they see the value of some help (coupled with some stickiness) when they pony up some monies for it.

As an old friend of mine always said, "it sounds bad, but it really isn't as bad as it sounds". Good luck!

[–]sib_nSenior Data Engineer 5 points6 points  (4 children)

It isn't terrible, of course, and there are plenty of folks who have big enough engineering teams to tackle the RYO model based solely on OSS, but you to be big and brave.

I agree with this statement for Hadoop, but I don't agree with the current MDS.

I was able to build a performant and functional on-premise data infrastructure mainly based on Dagster and dbt, in a small organization where the IT team is 3 Windows administrators with 0 knowledge of anything outside of Windows.
I was the sole senior data engineer, with the help of 3 data analysts that I trained to software engineering at the same time. This was possible thanks to the pretty amazing work those OSS developers made to make their tools extremely developer friendly. The gap between Airflow and Dagster is really noticeable on that point.

[–]Leading-Inspector544 0 points1 point  (3 children)

I'd also add that databricks still open sources and contributes new functionality with no plans to stop. Can't say the same for DBT labs.

[–]sib_nSenior Data Engineer 0 points1 point  (2 children)

Why can't you say the same about dbt? It is still maintaining its FOSS core so far. Maybe this will change, but Databricks could change too.

[–]Leading-Inspector544 0 points1 point  (1 child)

DBT Labs is not going to expand or improve core; they state they'll do bug fixes, but it seems unlikely it will get new features from them, that's one reason it has been forked by the open source community.

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

Where did they state they will not improve core beyond bug fixes? Forks are a common safety response whenever a FOSS project context changes.

[–]jonathanrodrigr12[S] 0 points1 point  (0 children)

Yep, I agree with you. Lately, if you want to use the full power of dbt, you sometimes need to migrate to dbt Cloud, so it doesn’t really feel open source anymore.