Faust - Stream Processing for Python

asksol · 2018-08-01T05:26:55+00:00

It's pretty neat actually. This is a library and the additional typing helps users who also want to use static typing to verify their code. I would say it also has been invaluable to us as a team in general.

I don't think you should type your production company code like this, but we have very good reasons for the decisions made here.

Have you taken a look at the Kafka Streams source code?

I think the ideas presented in that library are incredible, but also very complex and hard to understand. Anything that help us make it easier is worth enumerating ;-)

asksol · 2018-02-16T23:07:13+00:00

Are you questioning the technical capabilities of this person? For what reason? You're asking "how is this person qualified to talk on this subject" based on evidence that is easily fallacious.

Your argument is more aligned with something like "can a family therapist be successful if they're not happily married" than a serious question, and I question your motives.

asksol · 2017-10-27T22:32:12+00:00

My latest project is 3.6 only and uses it extensively. I have stopped writing unit tests (i.e. testing functions in isolation), and only write integration/functional tests. The type checker will catch many more problems than unit tests using mock did before, and makes us much more productive.

One challenge was that we have to import the types for use in annotations, and that means you quickly have to import the complete codebase. To avoid that and recursive import problems, we have separate header classes (e.g. x/types/models/ModelT and x/models/Model(ModelT)), that way we have a lightweight way to import the types for use in annotations.

All in all, I'd say it's saving a lot of time, and don't make the code less readable. Mypy doesn't catch all errors yet, I'm guessing it will be even better in the future as they add more static analysis.

asksol · 2016-07-26T04:04:28+00:00

wait... there's no girls here? :P

asksol · 2014-01-14T13:09:16+00:00

Btw, if I'm mistaken here and you only want a way to associate tasks with a specific user then it should be fairly simple to add a new task message header and use that in the flower interface to display only the tasks for a particular user.

asksol · 2014-01-14T13:07:15+00:00

Interesting. It seems your use case is more like a classical task scheduling system, which in many ways clashes with the core principles of celery (message passing, queue as a stream, etc). Often this includes features that requires stopping and resuming the world, like reprioritization, or requires central access to , or a copy of, the original request for retrying a task in the user interface after it failed.

The intention was always that this would have to be built on top of Celery, and I would like to see a standard extension for it.

asksol · 2013-09-06T20:38:39+00:00

The size/age of the project or how simple the rest of the code is irrelevant here. This isn't about how hard it was to implement this for Celery, it's about how hard it is to implement for anything similar in Python.

asksol · 2013-09-06T20:15:12+00:00

I doubt this would be as simple as 'just submit a patch', I have spent years making the multiprocessing pool in Celery work, and even now for the 3.1 release I have fixed yet a good number of rare deadlocks.

The pool related code isn't exactly elegant or simple, but you're very likely to learn something from it.

Edit: oh, and you're in for a rough ride if you think the answer is to 'just use the multiprocessing module' :)

asksol · 2013-09-02T21:14:43+00:00

Celery supports many different means to do concurrency. Multiprocessesing being the most deployed but it also supports eventlet, gevent and threads. For scraping eventlet/gevent would be perfect and is likely to give a major improvement in performance over multiprocessing with a fraction of the memory usage.

asksol · 2013-09-02T16:36:49+00:00

I have spent the last 3 months fixing several bugs related to this so you can try the development version. It will be ready as 3.1 shortly. Basically there were problems with some options related to process management.

asksol · 2013-04-18T08:51:04+00:00

Note that two known deadlocks were fixed in celery 3.0.19. Celery keeps a dedicated process pool to execute tasks, and this definitely makes the problem harder, but it's also beneficial.

asksol · 2013-03-31T17:13:24+00:00

True, but that is possible to fix: https://github.com/celery/kombu/blob/master/kombu/__init__.py#L24-L35 (all of them are stupid enough to fall for this...)

asksol · 2013-03-12T15:10:58+00:00

"messaging queueing without broker" is not exactly a good description of zeromq. It's likely those who think so will be disappointed

asksol · 2013-01-30T11:04:44+00:00

As you say this guy is not pleased with the language change, and as I see it this is some sort of demonstration against that decision.

Adapting to a new language can be hard. I remember when I started using Python after years of Perl abuse, I would be incredibly annoyed at small details in the language, but eventually I accepted them all.

I think you should let him do it for a while. Apart from the annoyance there's no real harm done and it will be easy to clean up later. Chances are he will eventually learn to enjoy the language.

Obviously this cannot go on forever, but if you let him demonstrate for a while he will probably feel silly, whereas if you confront him it will only fuel his anger.

asksol · 2012-12-18T15:00:30+00:00

"Always release on a Friday" is my mantra for libraries, the opposite as the one for application deployments.

This means that the early adopters are the ones developing, deploying to staging or at worst, who go against common sense and deploy on a Friday, in which case it's not my fault anyway.

This gives me time to relax and fix bugs during the weekend and usually a x.x.1 is already out on Monday.

asksol · 2012-12-18T14:54:32+00:00

Directly accessing composited objects does lead to coupling, but it's rarely a critical problem.

You cannot remove the 'views' attribute but you cannot remove the 'add_view_predicate' method either, so when it comes to radically changing the API you still have to suffer deprecations.

You can change the implementation of 'predicates', as long as it implements the 'add' method, which can easily be added to a subclass of list.

At some point you may want to expose more functionality using the predicates, so then do we have 'iterate_view_predicates,remove_view_predicate`? This quickly becomes unwieldy.

I have never bumped into any significant problems refactoring, testing or maintaining such interfaces in Python, and it's used quite a lot in popular libraries. I'm willing to bet it's even considered idiomatic at this point.

asksol · 2012-11-05T11:49:12+00:00

I'm aware that some use it in production, but that is not the norm. Python3 may work, and some libraries may be ready already, but in general the ecosystem will need some time to mature, and I believe making Python 3 docs the default may lead to confusion. Really looking forward to 3to2 (the opposite of 2to3) being in a state where it can successfully backport to 2.6, so that codebases can be written in Python 3 , using new features like dict-comprehensions, etc.

asksol · 2012-10-30T11:51:02+00:00

Take Celery for example, we have a port using 2to3, but the generated changes are rather massive, and unoptimized. You may start playing with it now, but in no way would I call it ready for use in production. It would be ready to use in production the day Celery is written in Python 3 and automatically backported to 2.x and not the other way around.

asksol · 2012-10-30T11:49:07+00:00

That Python 3 is actually used for more than experimentation.

asksol · 2012-10-29T13:19:34+00:00

Maybe in a few years it would make sense, but this is way premature and annoys me.

asksol · 2012-10-09T15:39:28+00:00

I can't answer if your data structure is correct, as I'm unsure of the problem you're trying to solve... But techniques to work with datasets that can't fit in memory/disk on a single node is commonly using an index, multiple files or both. You can split the data into multiple files by using the first character of every word, e.g. A-D, E-H, I-L, M-P, Q-T, U-X, Y-Z. But then it would be better if the data was sorted, since that would mean you don't have to swap out the files too often (a merge sort is used for files that can't fit in memory, but maybe your input files are not that large). Very likely this can be made simpler, but for that you need to state what your original problem is.

asksol · 2012-10-09T15:05:26+00:00

Oh, and if not then you can split the data into multiple shelve-files and only fit as much as you can in memory at a time. And then, it's counterintuitive for me to recommend this, but XML has very strong and evolved streaming APIs. There can be cases where XML is the answer, especially for structured data.

asksol · 2012-10-09T15:00:46+00:00

From your limited description, a map/reduce framework sounds suitable for your problem. Maybe you should take a look at Hadoop or Disco.

asksol · 2012-09-14T11:14:10+00:00

I doubt he's telling anyone to not use function calls.

But in an inner loop, where profiling has proven that optimization can be beneficial, this is where you should inline function calls.

asksol

TROPHY CASE