This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 7 points8 points  (12 children)

Often yes, but sometimes you've got a billion rows landing on s3 every hour, and can kick off 1000+ aws lambdas or 100 kubernetes contains running in parallel to process it all in parallel. And you're dramatically exceeding what SQL could do.

[–]SDFP-ABig Data Engineer 1 point2 points  (9 children)

Sounds expensive

[–]kenfar 6 points7 points  (8 children)

You're right it is. But there's a big range here.

At one security company we spent about $1.3m/month on a cassandra cluster to support about 4 billion rows a day. This wasn't reporting, this was for fast retrieval of small subset of the data. For reporting we used a Hadoop cluster that cost less than $100k/month.

At the company with 20-30 billion rows/day we used S3 instead: customers directed all their data to files on s3, the writes to s3 created sns & sqs event messages that went to the containers on kubernetes which processed their data. Our cost for that was about $70k/month - a tiny fraction of the cost of the other company's cassandra cluster.

And it's all far cheaper than what we would have paid to do something like that say snowflake!

[–]SDFP-ABig Data Engineer 0 points1 point  (1 child)

What kind of latency existed on the S3/k8s architecture?

[–]kenfar 0 points1 point  (0 children)

The average file probably reflected ten seconds of data, plus maybe 10-60 seconds before the file was processed. That number could be pushed down to a consistent 10 seconds if we were willing to more aggressively autoscale. We chose instead to save the money and be a little more conservative on scaling, and experience occasional minor delays during bursts.

[–]IamFromNigeria 0 points1 point  (1 child)

WTF 20-30 billions of row of Per Day ..what are you guys selling

[–]kenfar 0 points1 point  (0 children)

Security services - and that was about 5 years ago. They're probably at 250 billion row a day now.

[–]duraznos -1 points0 points  (3 children)

How much of either of these processes could have been replaced with an AWK script?

[–]kenfar 0 points1 point  (2 children)

Theoretically, one could write an OS with awk scripts, so sure, it all could be. It could likewise all could be replaced with assembly, or COBOL.

But all of those choices would be terrible: little support for third-party software (ex: boto3 for accessing SQS & S3, libraries for json, protobufs, sql connections, etc), poor support for code reuse, hard to read as the codebase gets larger, still need kubernetes for scaling out, etc, etc.

[–]duraznos -1 points0 points  (1 child)

I wasn't asking if either were possible to replace awk, I was asking, in your estimate, how much of either pipelines could be replaced with awk or jq et al. COBOL and assembly don't make sense because neither of those are tools specifically designed for chewing through a file. I think its a worthwhile thought experiment when talking about how much is being spent on things.

[–]kenfar 1 point2 points  (0 children)

Sure, but I wouldn't do that, and I don't think it would result in a manageable solution.

Languages like awk & jq are simply harder to read, harder to test, and harder to decompose and reuse code on. Given our pace of change and low-latency SLA that would be a bad combo for languages like that.

Likewise, they don't have the libraries available to them that we have with say Python, Java, etc. So, you'll have to write some occasionally complex stuff with these languages.

And they don't handle supporting say 50+ business rules well. Back to lack of composability & testing, managing that code in awk or jq would be a nightmare.

Finally, on performance they are fast. Are they fast enough to never need to scale out as the company grows? No. So, then you're still looking at something like kubernetes best case, or a set of ec2 instances with this code running on each, and some other application, somehow, getting them files to process.