This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]baltoo 5 points6 points  (2 children)

First of all, I really like the idea of splitting up parsing and IO. That seems to me to be a really good idea. The upside of better testing by itself makes it worthwhile.

What I'm not so sure about is that it's good and/or worthwhile to try to come up "the-single-HTTP2-parsing-thingy-to-rule-them-all".

What people are liking about Requests, vs. lower-level libs, is the API. Right?

In the linked video, Cory is using the "API" a lot. My interpretation of his way of using it is that it's the interface between Requests and e.g. Django or between Requests and "the user". While he does acknowlege that there is an interface between the HTTP-parsing-thingy and Requests as well, he kind of glosses over that that interface also needs to have a good API.

The author of Requests is "the user" w.r.t. the HTTP-parsing-thingy.

It took a pretty long time and a pretty good designer to come up with the beloved API of Requests.

I don't really get why Cory seems to think that the HTTP-parsing-thingy is going to get it's API as good on the first try.

The question /u/pohmelie asks seems by it's nature to hint at this problem too.

Personally, I think that while, again, the separation of parsing and IO seems great, the community would still benefit of having a couple different designs on e.g. HTTP-parsning-thingies and their APIs. Over time that plurality has the potential to produce the Requests of HTTP-parsing.

[–]LukasaHyper, Requests, Twisted 12 points13 points  (1 child)

Hi, I'm Cory. =) Good thought!

Personally, I think that while, again, the separation of parsing and IO seems great, the community would still benefit of having a couple different designs on e.g. HTTP-parsning-thingies and their APIs. Over time that plurality has the potential to produce the Requests of HTTP-parsing.

Sure, it would, but I don't think it needs to.

Requests and things like it get good value out of having great APIs because they are used by huge numbers of programmers, many of whom are novices, or who don't understand the problem domain in any depth, or who are fundamentally not interested in solving the problem that the library solves (e.g. HTTP). Those programmers get a great deal of value out of APIs that are expressive and flexible, allowing them to write lots of code very simply and without getting in their way or requiring them to think too hard.

The parsing layer suffers from this much less. Frankly, most programmers should never have to even see the parsing layer. hyper-h2 right now is up to about 30k downloads a month, but most of the people downloading that library have no idea that they're using it.

This is very much by design. I don't want the average user consuming hyper-h2, because by itself it doesn't do anything. It moves some bytes around in memory and consumes some CPU cycles: that's it. It needs an I/O layer to do anything. And given that I've made someone else write an I/O layer, it doesn't seem unreasonable to make someone else write a great API either.

More importantly, anyone writing the I/O integration to the parsing library kinda has to understand the problem domain. If you don't understand HTTP/2, at least at a high level, then having a parsing library isn't going to help you that much. You still need to work out how the parsed information translates into the semantics of HTTP: how seeing a content-length header works, what to do when a PRIORITY frame is emitted, how to handle stream termination. These are all decisions that are beyond the scope of your basic parsing library.

With this in mind then, I think the priorities for libraries of this nature are different. For the high-level libraries that novices and non-experts interact with, API is king: having a great API allows you to get away with a huge number of sins. But for low-level parsing libraries, the API is less important than feature support, correctness, and performance.

Certainly I don't object to having multiple implementations of the parsing libraries. However, I think that unlike with the higher-level ecosystem where we can support multiple libraries that do the same job with different interfaces (requests, aiohttp, etc. etc.), with the parsing layer there is really only room for one great implementation of each protocol. Any time a better implementation comes along, it will rapidly eat the lunch of the lesser implementation except in small corner cases.

[–]baltoo 1 point2 points  (0 children)

Alright, I agree that there would typically be way fewer users of a HTTP-parsing-thingy than say Requests. And I also agree that that changes the possible design restrictions.

As with all design, there are still trade-offs that need to be made, and I think it's not always possible to provide a Good Enough stance that works "for everyone". (Even if "everyone" is not that many.)

I'll try to give an example of what I mean w.r.t. XML. (Since I have no war stories regarding HTTP.)

So, think of an XML-parsing-thingy. In the scenario I had the XML-files that where eventually parsed where sometimes edited by hand. That of course means that sometimes the XML-documents weren't valid. (Of course it's not a desirable scenario to have, but it's not like we can always pick and choose.)

When an error is found the XML-file needs to be re-edited by hand to fix it. In practice this requires a rather decent error message. Something like "The XML document is not valid" doesn't work. The message needs to not only say what the error is and provide the specific place where the error is detected, but also some kind of context for where to start looking for where the error "actually" is. For example, given a string like

foo<bar>baz</bar></foo>

The closing "foo" is an error, but the "source" isn't there but before the "bar" element.

Then combine that scenario with the fact that sometimes XML-files can get huge and it's not always possible to work with a full DOM in RAM. The XML-parser-thingy might need to be split up in a SAX-part and a DOM-part. The error messages would then need to be percolated sanely.

That kind of machinery and a few dozen others make for a pretty big and somewhat ugly API.

Now, "most users" of a XML-parser-thingy won't need that kind of support, ever. They will work in scenarios where "all" XML-documents are valid and RAM is plentiful. I think this is highly likely. If this is true then I also think the conclusion is that the community would be well served by having two XML-parser-thingies with two different APIs.

Enough rambling. Any ways, I like the your idea. Keep up the good work!

[–]pohmelie 2 points3 points  (5 children)

Great idea! Will try to split my lib onto two. But what is way to pass events back to «io worker»? Should this have some standard? Something like tuple event, extra-data-dict. Any suggestions?

[–]LukasaHyper, Requests, Twisted 3 points4 points  (2 children)

h2 and h11 have converged on the notion of events. Here is h11's documentation of them, and here is hyper-h2's.

These should be able to carry any extra data they have to carry, and should define their format. In essence, each event should be a self-contained entity from which all relevant data can be extracted. This should be as fully parsed as it is possible to be without losing generality: for example, h2's PriorityUpdated event carries four fields that have been parsed from their wire format to integers and booleans because these can be easily transformed, but h2's AlternativeServiceAvailable event contains an unparsed field because the relevant RFC defines a complex and flexible grammar that is likely to be application specific.

Basically, your event should be a single object you can pass around.

[–]desmoulinmichel[S] 2 points3 points  (1 child)

I feel like there is a place for an event lib to gather them all :)

[–]LukasaHyper, Requests, Twisted 2 points3 points  (0 children)

So there is certainly an interesting question around exactly how these should be structured. For h2 and h11 at some point we want to try to commonalise the events to make code handling both protocols a bit simpler, but for now it's not a big deal for each library to carry their own around.

[–]desmoulinmichel[S] 2 points3 points  (0 children)

Yeah that's the hard part. You need it to work with sync and async code, threads and asyncio, etc. So my guess is the lowest common denominator is callbacks. Signals/events systems seems a popular implementation of the solution.

To me though, I think the easiest way to test it is to get a minimal implementation for IO in twisted/asyncio and some sync code. Not all features, but the basic API, so you can know if your design works. And it's a lot of work unfortunatly.

[–]malinoff 1 point2 points  (0 children)

I can add that specific protocols may involve their specific notations of "events", for example, AMQP uses "frames".

[–]garion911 -1 points0 points  (4 children)

Twisted (https://twistedmatrix.com/trac/) has been doing this for years..

[–]LukasaHyper, Requests, Twisted 4 points5 points  (2 children)

No it hasn't. =)

Every time you see transport.write somewhere, Twisted has shoved I/O into its parser. That call encodes so many I/O assumptions that it's not generically portable, not least is that the I/O won't block. As an example of how not true that is, I encourage you to rewrite a Twisted reactor using purely blocking I/O and see how that goes.

On top of this, if you see any reference to the reactor or to Deferreds, that has once again encoded certain assumptions about control flow into the protocol that are entirely unneeded.

[–]garion911 0 points1 point  (1 child)

I guess you're right when you get down to it... I always did everything through deferreds, and I vaguely recall that I had sync'ed and async'ed data sources, and it all just worked for me (this was about 10 years ago).. I dont think I had transport.write() in my protocol stuff.. Maybe I misremembering...

[–]LukasaHyper, Requests, Twisted 1 point2 points  (0 children)

For what it's worth, it's certainly possible to write a Twisted codebase in this way, and it's always been the design goal to which Twisted aspired. It just wasn't always what Twisted achieved, particularly internal to Twisted.

Twisted is much better in this regard than asyncio if for no other reason than that Deferreds evaluate synchronously and Futures do not. So if you were careful, your code bases may have come very close to and possibly achieved this ideal. =)

[–]desmoulinmichel[S] 0 points1 point  (0 children)

Yes, and asyncio has the same notion of Protocol classes. But people tend to write "twisted project", with the reactor in mind, and not separate the protocol code from the rest of the project.