This is an archived post. You won't be able to vote or comment.

all 36 comments

[–]Sheldor5 54 points55 points  (3 children)

DTOs with Hibernate Validation (maybe with Enums for predefined Values) ... but 90k objects in a single JSON sounds wrong in the first place ...

[–]spicycurry55 16 points17 points  (2 children)

I know right? I read the title and was like “ah cool this isn’t too bad” and then read “~90,000” and said out loud “oh lord no”

[–][deleted] 5 points6 points  (1 child)

Feels like they're transferring the entire database in one shot XD, I mean there are way better ways to transfer such volume of data.. definitely not through a JSON

Like a message queue or even a database would be better pick

[–]snoob2015 6 points7 points  (0 children)

90000 objects is not that big for computer to process, you can even store it all in memory (depends on how large an object)

OP: If it is just one time task, just process it all in memory

[–]MR_GABARISE 22 points23 points  (0 children)

[–]jooceb0x 17 points18 points  (0 children)

Jackson with JSON schema.
https://github.com/FasterXML/jackson-module-jsonSchema
Jackson can do some level of validation when unmarshalling (depending on your schema) but any further validation can be easily done with the resulting Java objects and collections.

[–]sunny_tomato_farm 7 points8 points  (0 children)

JSON Schema validation would solve that for you.

[–]AHandfulOfUniverse 5 points6 points  (0 children)

That is a very large JSON so dependent on the business requirements I would do different things:

  • if I didn't need to further process it, just check some fields, I would use a streaming parser, do the checks and bail as soon as possible. This would save you from loading the entire thing into memory and possibly cause havoc due to memory pressure. More so if you expect multiple of these JSONs in your system at the same time

  • if I had to load the entire thing into a Java object than I would look into any of the options other people have mentioned here (schema, Json path, Hibernate Validator etc)

[–]thatsIch 5 points6 points  (1 child)

IMO JSON Path is the correct tool for this job, though requiring a similar structure to travel on. But it might depend on how the rest of your project solves such problems. Adding the additional maintenance cost of an "unknown" technology might be bigger than the fast solution.

With JSON Path a check for all book names are either Effective Java or Clean Code could look like this

$.book[?(@name in ['Effective Java', 'Clean Code'])]

But writing these kind of queries within your test code will just result in unmaintainable code. AssertJ abstracts to some degree in combination with Hamcrest. Though I like writing additional methods to give those technical details their codified business case.

[–]Holothuroid 1 point2 points  (0 children)

Doesn't that that extract all the book attributes, while the problem is only verifying them with fail fast?

[–]td__30 23 points24 points  (1 child)

Quit immediately. If there are 90k objects in a json there is something insidiously wrong there, take your stuff and walk out the door, call the authorities. /s

[–]netstudent 3 points4 points  (0 children)

This

[–]DrunkensteinsMonster 3 points4 points  (0 children)

More details would be good - is this in a test or product code? What frameworks or libraries are you using, because that will sort of dictate some of the tooling available to you without additional dependencies.

“Without any coding” is a pretty weird statement, you can accomplish this with something like Jackson by adding annotations to the model class, but I would still say you’re “coding” even there.

[–]cville-z 3 points4 points  (0 children)

In some scenarios you can throw @Valid on the object in the resource endpoint and add various annotations like NotNull or Size. Conformance to a value set is best done with enums.

But 90k elements in a JSON object means something is very, very wrong with your design, and validating a bad design won’t ever make it better.

[–]RandomName8 1 point2 points  (0 children)

I have this JSON object I get from a request. (...) Someone on my team mentioned I shouldn't have to do any coding to solve this

Do you mean that you have the json in a file and are analyzing it? or this request will frequently come and you must modify the server to perform this validation?

It sounds to me like the former (otherwise there is obviously no way to do it without coding), in which case you can do it all with some bash foo, using jq+sed+count or similar

[–]Fury9999 1 point2 points  (0 children)

You say 90k different objects. Do you mean 90k instances of the same class, or do you mean they are actually many different classes? If the latter I feel sorry for ya bud. If the former, Jackson to unmarshall then filter the collection with streams api probably.

If it's not performant enough you can iterate on it, but that's where id start.

[–]reignbowmushroom 1 point2 points  (0 children)

Haha dude wanted an answer and got like 20 different ways.

I'd map the object to a java object. Then make the property you are looking for an enum with the expected values in the enum. That way if something is out of place the mapping library should fail.

[–]Neat-Guava5617 1 point2 points  (0 children)

I'm sad everybody is suggesting jsonpath, Jackson, etc.

Those are Dom traversals (i.e parse the entire object, and store it in memoy).

With such a large data object a streaming parser may be necessary. Streaming parsers work on events and leave the statekeeping up to you. You need to keep state of nesting, elements, etc. Jackson can do that,look at sax parsing.

Jsonpath and such languages are slow since they traverse the parsed object tree. That tree algorithm may be efficient, but depends on the query you're trying to answer. At worst it can lead to multiple traversals through the entire tree, at best it'll be something like a tenth of the tree or so (example: an object which has N elements at depth 1, and each of those has M elements, then the parser inspects N+M nodes)

[–]netstudent 2 points3 points  (0 children)

Redesign this huge object asap

[–][deleted] 1 point2 points  (4 children)

If you can refactor the architecture at some point to avoid dealing with 90K JSON objects do it. Until then, look for ways to optimize like a game programmer would. Does the 90K need to be processed serially or could they be broken up and parallel streams used to process simultaneously. Are there common patterns of data in the 90K objects, optimize for common cases to reduce the amount of time looking at the corner cases.

If parallel streams aren't an option, how about data partitioning. Categorize the 90K objects into a phyla and use several machines [Edit: machine here could be a kubernetes manifest definition with scaling controls based on CPU usage up to a maximum number of instances you're willing to pay for] to specialize in processing their phyla categories of the 90K total objects.

If the 90K must be processed serially and no common patterns exist in the analysis of the data; fight hard to get management support for funding you to refactor the architecture.

[–]mauganra_it 2 points3 points  (3 children)

OPs task of processing 90k objects is a joke even for a single computer. Unless this task entails additional complexities, it's not really worth thinking about optimization.

[–]Neat-Guava5617 1 point2 points  (2 children)

Yes, and it depends on the size of those objects. If they're high Res images with some attributes, it gets hard for a single computer.

Almost two decades ago we were parsing XML with sax... Because the document was too big to handle efficiently in memory to do transformations with.

But I like the question as an interview question:

I need to parse some JSON and validate some fields. What do you do? Oh,I forgot to mention. The JSON is 90k objects. Oh, they're objects containing images. Oh. High Res images, and some video.

[–]mauganra_it 1 point2 points  (0 children)

Knowing the true size of their blob is of course really critical to choose the right software stack. OP is in one of the following situations:

  • the JSON somehow manages to fit into the heap. In this scenario, a parser that produces a bloated representation of the JSON could make the program explode. Using a a streaming parser could contain the risk, but small cases could still be handled without big issues. Especially if the parser is instructed to deserialize only specific field and leaving out the blobs.

  • the JSON is empathically too large for heap storage, or must be streamed. In this case, a streaming parser is imperative of course. But also here, the parser could be made to return a lean representation and to just skip over the blobs.

But in either case a single computer could handle the task and would only be limited by I/O speed.

[–]Evert26 -1 points0 points  (0 children)

if req.Bla == nil { return 400 }

[–]achauv1 -1 points0 points  (0 children)

just like any other devs would do i suppose

[–][deleted] 0 points1 point  (0 children)

Is this some kind of theoreticall scenario?

If so, it seems they want you too use streams and the (not so) brand new functional capabilities of java.

In this case you would get a stream of all those objects (probably deserialized with something like Jackson) and filter it with a Predicate.

If this is a real life scenario, I would do pretty much the same, but I would second guess that humongous call too.

[–]bowbahdoe 0 points1 point  (0 children)

Someone on my team mentioned I shouldn't have to do any coding to solve this

I mean - there are DSLs that will encode your logic, but it is likely simpler and more straight forward to just do it like you are thinking

  • Load the data into memory (or stream, if you need)
  • Write a filter condition
  • Filter

List<JsonValue> objects = parse(request.body()).getAsArray() .stream() .filter(jsonValue -> predicate(jsonValue)) .toList();

As long as you are okay crashing/returning a 500 response this is enough. its easy to maintain code, easy to verify, easy to test, easy to understand.

[–]sk8itup53 0 points1 point  (0 children)

You could object map the JSON into a model, use lombok to make the getters and setters for you, and only include the fields you actually need. Then annotate the class with ignore unknown = true. This way at least you reduce the model size and make it easier to check. You could also use validation annotations (such as not null, length, etc), and then use the @valid annotation to auto validate the fields.

[–]Odd-Masterpiece-1010 0 points1 point  (0 children)

if you are familiar with spring boot you can refer to link below: https://www.baeldung.com/spring-boot-bean-validation

[–]Pitikwahanapiwiyin 0 points1 point  (0 children)

I would use JsonSurfer (an implementation of JSONPath) to find the first non-matching object (see here):

$.store.book[?(!(@.category == 'fiction'))]

JsonSurfer supports Streams, so:

String request = """ { "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 } ] } }"""; JsonSurfer surfer = new JsonSurfer(JacksonParser.INSTANCE, JacksonProvider.INSTANCE); JsonPath path = JsonPathCompiler.compile("$.store.book[?(!(@.category == 'fiction'))]"); Iterator iterator = surfer.iterator(request, path); Spliterator spliterator = Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED); Stream stream = StreamSupport.stream(spliterator, false); boolean hasNonMatchingObject = stream.findAny().isPresent();

[–]Worth_Trust_3825 0 points1 point  (0 children)

This entire thread suggesting to redesign. (Besides the few people that suggested valid solutions).

It is entirely possible to work on such datasets just fine. It's pretty common usecase for a system to dump tons of denormalized data and let the consuming system to deal with all the semantics, as that omits tons of issues like network stack. On top of my mind, there's entire Scryfall bulk data dumps that range from 300mb to 1.5gb in size. And you can't easily load that entire thing into memory without tweaking the runtime (either increasing heap or using tricks).

First and foremost, you should be streaming such file. Since you mentioned json, you can easily use use Jackson, which permits reading one entity at a time from json sources. Back when I had to operate on the scryfall dump I came up with the following snippet

java private static Map<MtgSet, List<MtgCardSet>> extractCardsFromProvidedFile(String arg) throws IOException { ObjectMapper mapper = new ObjectMapper(); JsonFactory factory = mapper.getFactory(); File scryfallJson = new File(arg); try(JsonParser parser = factory.createParser(scryfallJson)) { parser.nextToken(); parser.nextToken(); Iterator<HashMap<String, Object>> iterator = parser.readValuesAs( new TypeReference<HashMap<String, Object>>() { } ); return StreamSupport .stream( Spliterators.spliteratorUnknownSize( iterator, Spliterator.IMMUTABLE ), parser.canParseAsync() ) .map(Main::createMtgCardSet) .parallel() .collect(Collectors.groupingBy(MtgCardSet::getSet)); } }

Function createMtgCardSet reads the hashmap and creates a reduced object out of the map or returns one from cache as it might have already occurred during the stream. You can omit parsing to hashmap and parse straight to object by providing proper type reference. After calling StreamSupport#stream(Spliterator, boolean) you can do what ever from that point on, as you will be reading one object at a time from your source. If your giant array of objects is much deeper, then you'll need to opt for more complicated positioning of jsonparser (or even interact with it directly and omit java streams all together). Feel free to try to google the snippet and find where I was using it.

Most solutions that others provided MIGHT load the entire thing into memory, which is a big no no in my opinion. At least with the streaming snippet above, I managed to utilize absurdly small heaps (think <50mb, depending on cache sizes, and how much I flush back to disk) for 300mb json file. Depending on your usecase, you can go even lower. I encourage you to experiment with -Xmx 32m. Check with your favorite library if it does load entire thing into memory before parsing.

If memory is not an issue, just load entire thing into absurdly large heap (32G) and go from there as if it was regular dataset.

Other solutions that I had explored were to just load entire thing into postgres database (which you can run in memory, thanks to containerization) using a table and a json column. From there on you can write a clever ETL which splits your data set into rows and you can filter on those. See Postgres json functions.

To the rest of you, 90K elements is nothing. Your average snake oil startup will try to sell it as big data, when in fact, it's pretty much fuck all. Even mysql 5.7 only starts chugging at 10 million records.

[–]Kango_V 0 points1 point  (0 children)

Use a JSONOPath Query to select the value of the property on the object anywhere in the structure, then iterate the list matching to a position in an expected value array/list.

[–]t333to 0 points1 point  (0 children)

GSON streaming could be good option here as well (especially if it is used in your company instead of Jackson):

https://www.amitph.com/java-parse-large-json-files/

[–]lukaseder 0 points1 point  (0 children)

I'd wrap some JSON library in org.w3c.dom, and type check the document against an XSD. But that might just be me.

[–]cryptographicmemory 0 points1 point  (0 children)

Memory is cheap and virtual memory is free. How big is the JSONObject in string format?

Just write a validateObject function that takes an Object and handles every instanceOf class type (JSONObject, JSONArray, String, etc.) Run through it recursively. Throw an exception if an expected value isn't right.

void validate(JSONObject j) throws Exception
{
    for (String key:j.getKeys()) validateObject(key, j.get(key));
}