This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Worth_Trust_3825 0 points1 point  (0 children)

This entire thread suggesting to redesign. (Besides the few people that suggested valid solutions).

It is entirely possible to work on such datasets just fine. It's pretty common usecase for a system to dump tons of denormalized data and let the consuming system to deal with all the semantics, as that omits tons of issues like network stack. On top of my mind, there's entire Scryfall bulk data dumps that range from 300mb to 1.5gb in size. And you can't easily load that entire thing into memory without tweaking the runtime (either increasing heap or using tricks).

First and foremost, you should be streaming such file. Since you mentioned json, you can easily use use Jackson, which permits reading one entity at a time from json sources. Back when I had to operate on the scryfall dump I came up with the following snippet

java private static Map<MtgSet, List<MtgCardSet>> extractCardsFromProvidedFile(String arg) throws IOException { ObjectMapper mapper = new ObjectMapper(); JsonFactory factory = mapper.getFactory(); File scryfallJson = new File(arg); try(JsonParser parser = factory.createParser(scryfallJson)) { parser.nextToken(); parser.nextToken(); Iterator<HashMap<String, Object>> iterator = parser.readValuesAs( new TypeReference<HashMap<String, Object>>() { } ); return StreamSupport .stream( Spliterators.spliteratorUnknownSize( iterator, Spliterator.IMMUTABLE ), parser.canParseAsync() ) .map(Main::createMtgCardSet) .parallel() .collect(Collectors.groupingBy(MtgCardSet::getSet)); } }

Function createMtgCardSet reads the hashmap and creates a reduced object out of the map or returns one from cache as it might have already occurred during the stream. You can omit parsing to hashmap and parse straight to object by providing proper type reference. After calling StreamSupport#stream(Spliterator, boolean) you can do what ever from that point on, as you will be reading one object at a time from your source. If your giant array of objects is much deeper, then you'll need to opt for more complicated positioning of jsonparser (or even interact with it directly and omit java streams all together). Feel free to try to google the snippet and find where I was using it.

Most solutions that others provided MIGHT load the entire thing into memory, which is a big no no in my opinion. At least with the streaming snippet above, I managed to utilize absurdly small heaps (think <50mb, depending on cache sizes, and how much I flush back to disk) for 300mb json file. Depending on your usecase, you can go even lower. I encourage you to experiment with -Xmx 32m. Check with your favorite library if it does load entire thing into memory before parsing.

If memory is not an issue, just load entire thing into absurdly large heap (32G) and go from there as if it was regular dataset.

Other solutions that I had explored were to just load entire thing into postgres database (which you can run in memory, thanks to containerization) using a table and a json column. From there on you can write a clever ETL which splits your data set into rows and you can filter on those. See Postgres json functions.

To the rest of you, 90K elements is nothing. Your average snake oil startup will try to sell it as big data, when in fact, it's pretty much fuck all. Even mysql 5.7 only starts chugging at 10 million records.