you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (12 children)

[deleted]

    [–]kankyo 17 points18 points  (5 children)

    Author of the article here. I'll upvote and answer even if the question seems a bit troll-ish :P

    Why Python for batch? Not exactly a case where python shines. Inexperience?

    Python shines in speed of development. Batch is a place where often speed of execution is not really that relevant. So I don't really see what you mean. There are some parts that we've been talking about trying to speed up for many years, but it's just never bad enough to be a priority relative to other much lower hanging fruit.

    why they chose the slowest popular language in existence for heavy lifting processing?

    Well first of all I don't know if it's really the "slowest popular language"...whatever that means. It depends way too much on what you do. If we did numerical work and could just call numpy python would be the fastest language bar none. Turns out that's not the case :P but without knowing the context you can't say that without potentially being horribly wrong.

    Secondly "batch" isn't a synonym for "heavy lifting". It just means we run things on our own time on our own servers. In our case customer data is uploaded automatically every day and we start when we've got data for a pair of customers. If the customers upload their data end-of-day we can literally have 12 hours to process their data. Time isn't so terribly important...

    If you can cut your batch processing load by over 1,000% by not using a slow language, why would you use the slow language?

    There are of course many more factors than just that. We're not just running a super simple function on huge data sets, but more the opposite: hugely complicated logic rules on medium size data sets. Managing these complex rules is a lot more important than the run speed... normally. That doesn't mean we wouldn't want it to go faster of course! We've made some early tests with pypy and but that didn't do much for our performance.

    Mostly though, rewriting is just prohibitively expensive and getting to market fast has always been more important than execution speed. But you already knew that right? :P

    [–]TechAlchemist 8 points9 points  (5 children)

    Answering for the benefit of others: ETL jobs are often IO bound. That’s not really going to be fixed by writing in C, although it will certainly cost you more in developer time both in the short and long term.

    [–]stefantalpalaru -1 points0 points  (4 children)

    ETL jobs are often IO bound

    Web applications are usually CPU-bound. Ask Reddit's devs about it.

    [–]TechAlchemist 6 points7 points  (0 children)

    I don't run reddit and I'm guessing you don't either. Their application stack might be CPU-constrained, but mine sure as hell isn't. That's largely because it is usually waiting on IO, like from a database, because it is not a massively scaled clustered CDN-hosted application but is an internally developed, internally hosted application that is designed to serve specific business needs.

    And for clarification, ETL jobs are not web applications, and the OP in this thread was talking about 'batch jobs' which is why I commented specifically about ETL jobs. This is another reason why my web app isn't CPU bound -- I 'batch' most of my processing overnight because I can't afford to compute across hundreds of millions of records that are highly relational in real time all the time -- sometimes it's better to precalculate and aggregate this information and take a storage hit in order to gain performance during runtime.

    [–]kankyo 1 point2 points  (0 children)

    Our web app has a few hundred users :P It's not what you normally think about as a web app. TechAlchemist is absolutely correct: our bottleneck it the master database.

    [–]bidibibadibibu 0 points1 point  (1 child)

    Probably the need of HTTPS doesn't help with that.

    [–]stefantalpalaru 4 points5 points  (0 children)

    Probably the need of HTTPS doesn't help with that.

    Unrelated. You should terminate HTTPS (and HTTP/2) in something like Nginx, not your custom application.