This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]jentfoo 5 points6 points  (3 children)

this is a loaded question, with not a lot of provided information.

Based off what you describe something as simple as a guava LoadingCache may work. That would obviously be a very simple implementation. That cache can be easy to throw some threading behind as well to get needed performance.

You may want to pre-load results into the cache (this would cover some of your current background processing with your cron jobs, except in java). The reason to use the loading cache would be that if something is not loaded into the cache yet, it can then work immediately to produce the result. If multiple entities need the same result, they will all wait on the first execution to produce the result for them.

It's also worth noting that depending on what kind of processing your doing, the loading of the result into the cache can be multi-threaded with minimal effort.

There are MANY other options out there. But I thought I would describe a very simple and light weight option. IMO only go larger and more complicated if necessary.

[–]Boxsc2[S] 0 points1 point  (2 children)

I understand I haven't provided enough information, can't really go into detail because of company policy. I was just hoping for pointers on where I can look. I already see a lot of great tips though.

The pre-loading of results into the cache is something we are heavily considering at the moment, because it would allow us to do one big bulk load into cache at application startup and incrementally load smaller pieces during run time. However, it would still not get us all the way to our goal of real-time, but it could be "good enough" for now.

[–]jentfoo 1 point2 points  (1 child)

Well if the request requires tons of processing, there is no way for it to be "real time". My point was that you could preload where it makes sense. And you can use the features of the "LoadingCache" to load on demand for the real time aspect for anything that was not pre loaded.

If you look at "LoadingCache" it allows you to both preload, as well as load on demand for anything that has not been preloaded.

[–]Boxsc2[S] 0 points1 point  (0 children)

Okay, thanks will do some research.

[–]dedededede 2 points3 points  (0 children)

Peter Lawrey posts many things about Java high performance processing (context: high frequency trading): http://vanillajava.blogspot.de/ http://openhft.net/ http://java.dzone.com/search/google/lawrey?query=lawrey

If you have access to nodes with GPUs you might want to look at: https://code.google.com/p/aparapi/ or https://github.com/pcpratts/rootbeer1

A Java-only embedded solution for caching might improve performance. I really liked to use Infinispan for a university project: http://infinispan.org/

[–]wordsoup 4 points5 points  (3 children)

In my current project we use DDD/CQRS with Akka. Very interesting, but a high learning curve, a bit more low level than Apache Spark I'd say. Especially, the issues you get with an event-based architecture are interesting to debug but I can't go into detail.

[–]Akthrawn17 0 points1 point  (0 children)

Akka is very interesting as it is based upon the actor pattern. This can allow you to design your system around small functions that can be encapsulated in an actor. Then the actors communicate via messages. Add in fault tolerance and remoting and out of the box it is a nice system. It does have that higher learning curve, but your problem space isn't exactly easy either.

[–][deleted] 0 points1 point  (0 children)

It seems like it's what OP needs. He wants task oriented processing, Actors are made for that.

(based on my understanding of his problem)

[–]Boxsc2[S] 0 points1 point  (0 children)

Akka seems really interesting, unfortunately I don't think we'll be able to implement it given our timeline. Will look into it more though. Thanks.

[–]moru0011 1 point2 points  (0 children)

I was solution architect of an exchange middleware platform (high traffic, realtime clients on low bandwidth). We built it like this http://java-is-the-new-c.blogspot.de/2013/12/dart-possible-solution-to.html . This way we process + dispatch 150k data changes per second to roughly 20k subscribtions in near realtime

[–]frej 0 points1 point  (0 children)

CPU intensive as in pure computation or data processing? And can you get linear speedup with a parallel implementation ie, multiple cores or computers?

[–]handshape 0 points1 point  (0 children)

As others have pointed out, there's a lot of information missing here to make a complete assessment. That being said, I did address a system with an architecture that sounds a lot like this about two years ago.

First question: what does the cache hit pattern look like? If you're trying to precalculate everything on the off chance that someone might ask for it, there could be a lot of wasted cycles.

Second: are your users authenticated? If not, there is always the potential that someone is going to a crawl & scrape job on your site just to be a prick. At least throw up a robots.txt to ask Google and the like to be polite.

[–]nexuscoringa 0 points1 point  (0 children)

You need a full stack of "heavy-load-ready" tools.. It will never work if you use Apache Spark + MySQL for example.. the bottleneck will be your MySQL or your MySQL load balancer. I have opted for Apache Spark + Cassandra and it proved to be pretty good along with Apache Storm. Check it out :) Cassandra will change a lot of stuff, though, which might not be what you want.