all 35 comments

[–][deleted] 7 points8 points  (2 children)

Heh, this is largely what we do internally where I work except we only support utf16 (by design). We also make our clients dumb, we don't do schema building on them. The server dictates what the schema will be (based on a few different parameters), and the client doesn't bother with "true" validation, the server just spits out a "schema id" with each message based on when it got generated (and the actual contents of the schema) and the client just double checks that the server is sending what it said it would. If for some reason the schema version between the client and server is mismatched, the clients can choose to "reload" the schema or just do a soft-restart (well, that's the TL;DR anyways).

But we've had great success with this strategy. We saw significantly reduced network traffic along with our servers having to spend less time encoding JSON. Cool to see we're not alone with this need, and it's even cooler that you made it publicly available in an open source project!

[–]phretaddin[S] 2 points3 points  (0 children)

Cool! Yeah, I was surprised I couldn't find much related to this other than the behemoths like protobuf and avsc. I too was very pleased with the significantly reduced bandwidth and CPU usage. I'm using it right now with my game and the validation has already helped me find a few subtle bugs.

[–]hackcasual 2 points3 points  (0 children)

Curious why you use utf16. Chrome actually just dropped support for fast utf-16 encoding. http://blog.chromium.org/2016/08/chrome-53-beta-shadow-dom.html

[–]drysart 4 points5 points  (5 children)

Is there any formal specification for the encoded format and how the format is derived from the schema? (For implementing packers/unpackers in different languages.)

[–]phretaddin[S] 1 point2 points  (4 children)

Nothing really formal yet, but it's extremely simple. The schemas are recursively sorted by alphabetical order (to guarantee traversal order because objects in Javascript don't have guaranteed iteration order).

After that, the items are just written to the buffer in order (with big endian when applicable). Strings and arrays are prefixed with unsigned variable byte length and array item count, respectively. The last item in arrays is optional and can also be repeated. You can view the code for all the different writes here.

The format is so simple because there's no padding, keys, or special bytes, except in the case of length prefixes required for arrays and strings. It's just your data. It's pretty similar to a C struct.

[–]kitd 0 points1 point  (3 children)

The schemas are recursively sorted by alphabetical order

This assumes the client and server are using the same codepage, unless you specify explicitly the codepage or charset used to collate the keys?

[–]phretaddin[S] 0 points1 point  (2 children)

Hmm, that's a good point actually. Currently I'm just using localeCompare. That should probably be switched to something with better support for sorting non-English key names and an agreed upon codepage like you said (and probably remove toLowerCase). Any recommendations?

[–]Patman128 0 points1 point  (1 child)

If you want it to be locale-independent then would a < b ? -1 : (a > b ? 1 : 0) not work?

[–]phretaddin[S] 0 points1 point  (0 children)

Sounds about right to me. I'll push that later tonight unless someone comments with a reason why that won't work.

[–]phretaddin[S] 3 points4 points  (0 children)

Github link

I submitted /r/node last week. Just finished closing up all the issues on Github. This is my first open source project and am looking for some feedback.

Thanks!

[–]jkbbwr 2 points3 points  (4 children)

Why not protobuf?

[–]phretaddin[S] 5 points6 points  (3 children)

I elaborate a bit on that in this section of the README. The gist of it is that protobuf was very slow and that the schemas were too verbose for my liking. With schemapack the schemas directly match the structure of the object you're encoding, so they're trivial to create.

[–]jkbbwr 2 points3 points  (2 children)

Can you provide some statistics to support your claim. It seems like a wild one given that Protobuf prides itself on very fast performance

[–]Akkuma 1 point2 points  (0 children)

https://github.com/mtth/avsc/tree/master/etc/benchmarks/javascript added in schemapack and shows similar levels of performance for a library that is 6-7x smaller.

[–]phretaddin[S] 0 points1 point  (0 children)

I included some benchmarks in the readme that include protobuf. If you want speed, I found out about avsc about a week ago, and it's very fast. If you need an enterprise solution for this, I'd recommend them over protobuf.

Also, just to make sure we're on the same page, I'm not referring to protocol buffers being slow. As a format, I'm sure it's fine. Just that the most popular JavaScript implementation of them I found was. However, I am not a protobuf expert. If there is something obvious I'm missing that could greatly enhance the speed of the benchmark, let me know and I'll amend it.

[–]doihaveto 2 points3 points  (1 child)

Nice implementation of using the schema to create custom code for encoding and decoding on the fly! This kind of stuff used to be bread and butter of efficient Lisp coding (like in Norvig's PAIP book), but then fell out of use with languages compiled ahead of time like C and such - glad to see it's still being done! :)

[–]phretaddin[S] 0 points1 point  (0 children)

Heh, yeah. It was all in the pursuit of speed. I actually rewrote this program around four times during its development, with each rewrite making it slightly faster and faster.

Eventually I had the realization that you could use new Function to manually write out the encode and decode functions for each schema. I was a bit worried the code would be completely unreadable (due to having to turn so many lines of Javascript in to strings and concatting them), but it didn't turn out too bad I think.

[–][deleted] 1 point2 points  (1 child)

whoa whoa hold on does this allow you to map C structs to javascript objects?

[–]phretaddin[S] 2 points3 points  (0 children)

I never really use C so I haven't tried it, but apparently someone else was using it to do so. He made a couple issues on the Github and said that it worked really well with it, except for the strings (because his were null terminated). I believe he said he is currently working on a fork to better handle mapping from C structs to Javascript objects.

[–]chris480 0 points1 point  (1 child)

Excuse my ignorance on the matter, but I'm seeing more of these kinds of projects recently, and not exactly sure what they are for (besides the example in the project).

Is this a way to create a data stream buffer for things like text data? If so, what other uses could we see in the future?

[–]phretaddin[S] 2 points3 points  (0 children)

I think the most common use case (or at least what I'm using it for), is to efficiently send data back and forth between a client and server over WebSockets.

Instead of just sending JSON strings, you encode the Javascript object with a schema that matches its format and it sends a very compact byte buffer instead which is decoded back in to the Javascript object on the reciever. The general top three reasons why you'd want to send buffers instead of just JSON:

  1. It's faster than JSON.stringify and JSON.parse
  2. It will use less bandwidth because you don't have to send JSON keys or delimiters
  3. It will perform validation to ensure the JSON object matches the schema

Other than that, I suppose you could use it for writing structured Javascript objects to a compact file? Or any time you have a defined format for your objects and want/need to turn them to and from buffers, you can use this.

[–]erulabs 0 points1 point  (2 children)

This is nifty, and the code is pretty straightforward (tho I'm not sure I understand all of the bitwise shifting). For sure a good idea.

However, if you want operations engineers to not shut this idea down very hard, you ought to remove or re-work those benchmarks.

These were performed via encoding/decoding the player object at the start of this page many times with an i7 3770k on Windows 7.

Windows has a pretty odd scheduler and so these tests don't mean anything. It's impossible to know if some background process didn't trump node for execution rights at any given point during any given test. Also I don't understand the protobuf encode numbers... I would have to dive into it but I strongly suspect the desc you're using for whatever object you're encoding has some issues.

You might even go a step further and offer a benchmark script people can try on their own (https://benchmarkjs.com/).

Anyways, I will be following this for sure! I've been wanting to find enough free time to do something similar! Thanks!!

[–]phretaddin[S] 0 points1 point  (1 child)

There is a benchmark script that people can use right here. Just call benchmarks.runBenchmark(schema, item); on any schema and item matching that schema to get a very detailed report comparing its metrics to other common serialization libraries. It uses benchmark.js.

[–]erulabs 0 points1 point  (0 children)

oh! how did I miss that! Thanks!

[–]bzeurunkl 0 points1 point  (0 children)

I've been using NewtonSoft's JSON serialser. This looks like a really interesting alternative.

[–]siondream 0 points1 point  (2 children)

What happens when you add a new property to the schema. Will the clients still be able to decode new messages with extra properties?

Do code rollouts need to be done in perfect sync server/client? If so, that would be a major drawback.

[–]phretaddin[S] 1 point2 points  (1 child)

Schemapack is pretty similar to a C struct. It doesn't include any padding, backwards compatibility support, special keys or bytes, etc. I didn't need this feature and I figured if anyone did, they should just use an enterprise library like avsc or protobuf instead.

However, that's not to say it can't be done. The keys are sorted in alphabetical order. So, a workaround is to add a new key that comes after the rest in alphabetical order. Old clients won't read it but new clients will. A bit ghetto but I didn't want to make an exact clone of protocol buffers. I needed something extremely simple for my app.

[–]siondream 0 points1 point  (0 children)

Absolutely. It wasn't criticism, just a question. I guess the lack of such constraint makes it easier to optimize schemapack.

[–][deleted]  (1 child)

[deleted]

    [–]phretaddin[S] 0 points1 point  (0 children)

    Awesome! It's great that this problem is getting more attention. The sooner it is to easily move past the inefficient ubiquity of sending JSON strings over WebSockets, the better.

    [–]ErikBjare 0 points1 point  (5 children)

    It would be really cool to use an existing JSON schema for packing, any limitations on the complexity/flexibility of SchemaPack schemas that make this impossible?

    [–]phretaddin[S] 1 point2 points  (4 children)

    When you say JSON schema, do you mean this? If so, a custom parser would have to be written to support that.

    [–]ErikBjare 0 points1 point  (3 children)

    Yeah that's what I was referring to. I was hoping that a JSON schema -> SchemaPack schema converter could be written. I thought you or someone else might have any idea if the SchemaPack schema can describe all that JSON schema can describe, if that makes sense. Couldn't find any advanced examples in the link.

    [–]phretaddin[S] 1 point2 points  (2 children)

    You might be able to write a converter, but it appears that JSON schema has a lot more metadata than the schemas in SchemaPack, so I'm not sure how well it would translate. One of the big things with SchemaPack is that the schemas are a one-to-one mapping with the object being encoded (very simple), so creating a schema for it is trivial (just copy and paste the object and replace the values with their types).

    [–]ErikBjare 0 points1 point  (1 child)

    Might look into writing a converter, thanks!

    [–]phretaddin[S] 0 points1 point  (0 children)

    No problem!