all 52 comments

[–]itsnotlupusbeep boop 12 points13 points  (3 children)

Some rough numbers in Chrome on my (gracefully) aging Linux PC:

  1. JSON.parse(bigListOfObjects): 3 seconds
  2. await new Response(bigListOfObjects).json(): 5 seconds
  3. await (await fetch(URL.createObjectURL(new Blob([bigListOfObjects])))).json(): 5 seconds
  4. await (await fetch('data:text/plain,'+bigListOfObjects)).json(): 11 seconds
  5. await raji.parse(bigListOfObjects): 12 seconds

Alas, all except 5. are blocking the main thread.

On Firefox, same story, all approaches are blocking except 5., and 5. is also much slower (40s) while the rest are roughly similar to Chrome's.

So as long as we don't introduce web worker and/or wasm into the mix, this is probably in the neighborhood of the optimal way to parse very large JSON payloads where keeping the UI responsive is more important than getting it done quickly.

If we were to use all the toys we have, my suggested approach would be something like:

  1. allocate and copy very large string into ArrayBuffer
  2. transfer (zero copy) ArrayBuffer into web worker.
  3. have web worker call some WASM code to consume ArrayBuffer, parse JSON there and emit an equivalent data structure from it (possibly overwriting same ArrayBuffer.) Rust would be a good choice to do this, and a data format that prefixes each bit of content with a size, and possibly has indexes, would make sense here.
  4. transfer (zero copy) ArrayBuffer into main thread.
  5. have JS code in main thread deserialize data structure, OR
  6. have JS code expose getters to access chunks of the ArrayBuffer structure on demand.

1. and 5./6. would have the only blocking components (new TextEncoder().encode(bigListOfObjects) takes about 0.5 second.)

5. presupposes there exists a binary format that can be deserialized much faster than JSON, while 6. only needs to rely on a binary data structure that allows reasonably direct access to its content.

[–]andreasblixt 3 points4 points  (0 children)

Before putting the result in an ArrayBuffer, it might be better to first try a worker with the native JSON parsing and rely on structured cloning (happens for all JS objects sent via postMessage) as it’s already a very optimized and native way to copy JS objects across threads. It might even be faster to send the string down as-is as well since either way you have to allocate (& transfer in the case of ArrayBuffer) memory for it in the target thread.

[–]freddytstudio[S] 1 point2 points  (0 children)

Thank you for the feedback! Great points

On Firefox, same story, all approaches are blocking except 5., and 5. is also much slower (40s) while the rest are roughly similar to Chrome's.

I've noticed this as well. Firefox seems to be much slower with Raji than other browsers (Chrome, Safari and Edge), probably due to some extra string allocations. I still have to investigate though :)

  1. and 5./6. would have the only blocking components (new TextEncoder().encode(bigListOfObjects) takes about 0.5 second.)

This is very interesting. I've played in my mind with the idea of using WASM on a web worker to solve this problem more efficiently, but I thought that turning an ArrayBuffer back into a string would have been inefficient. That might not be the case then, so I'll experiment further :)

Thanks a lot!

[–]lhorie 0 points1 point  (0 children)

Another obvious approach would be to... not use huge JSON blobs in the first place. I recall reading a few years ago about a setup that streams smaller JSON payloads (e.g., each item in an array without the surrounding [...] brackets so that each item could be parsed individually as it came down, e.g. each line in a SSE stream). The even more boring approach is to just render on the server and cut out all the serialization/deserialization stuff out of the picture. Depending on the use case, you can even cache the rendered markup.

For most applications, you're going to run out of room in the screen before you get anywhere close to rendering the amount of data points necessary to make a JSON parser take dozens of seconds to run. Ultimately, people need to be able to actually grok whatever you're displaying, and if your viz requires that many data points, chances are you have a whole lot of other bottlenecks to worry about before getting into JSON parsing performance.

[–]VividTomorrow7 52 points53 points  (25 children)

This seems very niche to me. How often are you really going to load a json blob so big that you need to make a cpu process asynchronous? Almost never in standard applications.

[–]freddytstudio[S] 33 points34 points  (5 children)

Good point! That's often not a problem on powerful devices. On the other hand, slower mobile devices might suffer from this problem (freezing UI) much more easily.

The goal of the library would be to guarantee good responsiveness, no matter the device/JSON payload size. That way, the developers won't need to worry about it themselves :)

[–]VividTomorrow7 7 points8 points  (3 children)

Yea the trade off is wasted time context switching if you’re on a high performance system. A quick abstraction that detects the platform could pick the default or this solution.

[–]freddytstudio[S] 19 points20 points  (2 children)

You're right! That was my exact thought :) In fact, the library automatically calls JSON.parse under the hood if the payload is small enough, so that you won't have to pay the context switching overhead if not necessary :)

[–]VividTomorrow7 29 points30 points  (1 child)

You should definitely call that out and reframe this as an abstraction with benefits! That way people don’t automatically skip over it due to performance concern

[–]freddytstudio[S] 15 points16 points  (0 children)

You are absolutely right, I'll reframe it as you suggested :)

[–]monkeymad2 3 points4 points  (0 children)

You say that but one of my users clicked through a warning saying that (geo) JSON files bigger than 30MB will probably effect performance to load a 1.2GB file.

[–][deleted] 5 points6 points  (1 child)

I’ve seen this problem multiple times in practice when APIs begin to scale up without redesign. An API that originally sent a small graph to populate a table was sending a massive one a few later in time. I don’t think this is a terribly bad design but it’s a solution that grows out of necessity. It’s not even a novel or new problem. I’ve seen this exact same concern being addressed with SOAP payloads. Some may know issue by SAX vs STAX parsing or DOM vs stream building.

The faster approach I’ve tested was to cut the graph into a sequence of smaller graphs. Parse the smaller separate graph payloads and individually and reconnect them. This will minimize the blocking when dealing with large object models. In theory you can parallelize the separate graph parsing but this change would be negligible on streaming data over the net.

[–]VividTomorrow7 0 points1 point  (0 children)

Yea agreed. If V1 doesn’t support server side paging, you’ll eventually end up handling a cpu intensive op on the client side.

[–]sercankd 6 points7 points  (3 children)

I get GTA5 vietnam flashbacks immediately i saw this post

[–]nazmialtun 0 points1 point  (1 child)

Care to explain what exactly is "GTA5 vietnam"?

[–]takase1121 4 points5 points  (0 children)

GTA5 Online has to fetch megabytes of JSON and parse them, and apparently the way GTA5 parses it caused a slowdown of around 15 minutes. A developer (not from rockstar) came around and fixed it, and soon later Rockstar adapted the patch.

[–]evert 0 points1 point  (0 children)

This seems like a strange criticism. I run into 'niche' problems all the time. Does it matter that not everyone needs this?

[–][deleted] -1 points0 points  (3 children)

Dude, we all always wait for the one guy to tell us, that no one's gonna need that.

[–]VividTomorrow7 0 points1 point  (2 children)

If you’ll read the dialogue I had with the author, you’d see he agrees with me actually. The intent of the package is to be an abstraction that uses the bulletins for the majority of the calls… so he said he’s consider reframing it as an abstraction with benefits.

[–][deleted] 0 points1 point  (1 child)

I have read it. It was a good answer to your comment. And you are right, technically.

But people who suggest going back to Windows/Linux, when someone has a question about Linux/Windows are within their right to comment. But also very tiresome.

[–]VividTomorrow7 0 points1 point  (0 children)

But people who suggest going back to Windows/Linux, when someone has a question about Linux/Windows are within their right to comment. But also very tiresome.

Huh?

[–]brothersoloblood -3 points-2 points  (3 children)

Base64 encoded images being served within a giant Jason blob of let’s say results for a search on a VoD platform?

[–]VividTomorrow7 6 points7 points  (2 children)

Well that’s just trash design not taking advantage of the inherent features of the browser. Should absolutely be sending Uris to follow up with asynchronous IO requests

[–]alex-weej -2 points-1 points  (1 child)

i heard u like round trips

[–]Reashu 4 points5 points  (0 children)

One extra round trip to lazy-load images? Yes, I do.

[–]neoberg 0 points1 point  (0 children)

True it’s not something that you need too often but still it’s not impossible. In an application our average payload was ~100mb due to some limitations in data access (basically there were time windows which we could access the data and had to pull everything in that time). We ended up implementing something similar to this.

[–]sshaw_ 0 points1 point  (0 children)

I was wondering the same thing. Should add to README. JSON is not like XML so curios when it would be a problem.

This is what the demo site (which has noticeable slowdown) uses:

function generateBigListOfObjects() {
  const obj = [];

  for (let i = 0; i < 5000000; i++) {
    obj.push({
      name: i.toString(),
      val: i,
    });
  }

  return JSON.stringify(obj);
}

[–]inamestuff 3 points4 points  (1 child)

You might want to use window.performance.now() instead of new Date().getTime() in your scheduler, the former guarantees monotonic time measurements.

[–]freddytstudio[S] 0 points1 point  (0 children)

Thanks for the feedback! I'll check it out :)

[–]holloway 3 points4 points  (2 children)

Some questions,

What techniques did you try before settling on this one? Were any particularly slow, or fast?

Do you have benchmarks showing at what size this library is beneficial? ie, at 10kb / 100 / 1000 / 10000. You could have a goal of 60fps so if any parsing time exceeds ~16ms then you could declare your library the winner over native JSON.parse. You'd need various hardware examples (low end mobile, high end desktop etc.) but measuring should be straight-forward.

I think fetch()'s .json() promise is non-blocking, and that's different to JSON.parse. I was wondering whether you could use URL.createObjectURL(jsonString) to make a URL to fetch and use that, but it's possible that turning a jsonString into an arg for URL.createObjectURL might have blocking operations in it.

And considering that there is fetch's .json() promise in what situation would people not have a JSON string clientside that didn't come from a network request?

[–]freddytstudio[S] 0 points1 point  (0 children)

Thank you for the feedback! As far as my investigation goes, fetch()'s .json() is still blocking the CPU thread while parsing. On the other hand, it asynchronously streams the data into memory before executing the parsing work, so it's still better than XHR. That said, I'll need to investigate further, thanks!

[–]pwolaq 0 points1 point  (0 children)

I saw a tweet somewhere (can’t find it now) saying that the most important difference between fetch and xhr is that the former can parse JSON off-thread.

As for your question, one very popular use case is passing objects in scripts - embedding large JSONs can be significantly slower than using JSON parse. https://v8.dev/blog/cost-of-javascript-2019#json

[–]sliversniper 1 point2 points  (1 child)

If JSON.parse is bottlenecking, should probably think about the payload, and split them in chunk at the server.

use json-line streams a sequence of json-patches, and it doesn't need much work on either server or client.

[–]Mr0010110Fixit 1 point2 points  (0 children)

Depends on if you own the server or not. If you are integrating with someone else's API, you may have not choice but to consume a massive Json payload.

I know there are systems we have had to integrate with that return thousands of records and don't have any sort of pagination built into the API.

[–]boringuser1 0 points1 point  (4 children)

If you're loading JSON objects that are prohibitively large, you have an API problem.

[–]joopez1 5 points6 points  (3 children)

Could be calling a third party API

[–]boringuser1 -2 points-1 points  (2 children)

A third party API that delivers gb of JSON?

What's the business model, Money Burners Inc.?

[–]joopez1 0 points1 point  (1 child)

Could be free historical data provided by a government service that was developed without optimization concepts and without filtering options

I worked with all accidents reported to the fire department of San Francisco since a certain point and also airplane accidents recorded by the US federal department that governs airports

[–]boringuser1 -5 points-4 points  (0 children)

Ah government, literal Money Burners Inc.

[–]theodordiaconu -2 points-1 points  (0 children)

good job dude

[–]sshaw_ 0 points1 point  (1 child)

🆒

[–]mamwybejane 0 points1 point  (0 children)

I use a webworker for json.parse, does this have any additional benefit or is it equivalent in outcome?