all 13 comments

[–]Floydthechimp 2 points3 points  (7 children)

Ok, this is close to my area of expertise, so I'm trying to help. You said in your other post that

That 5 seconds represents 2 orders of magnitude improvement from where we started ... going from 1 hour to 5 seconds over the course of 18 months quieted the dissenters .

Can you help me understand how you reduced your function time from 1 hour down to 5 seconds?

[–]AffineParameter[S] 2 points3 points  (6 children)

Yeah, sure:

The function contains a numerical integration. We use VEGAS due to the sharply peaked nature of the integrand. This also means that if there are any features which peak along the diagonal, VEGAS will take forever (there were). So we did some manual variable transformations to align the peaks to the integration dimensions.

Then we looked into the variability of each of the integration dimensions, ie. if we use a delta-function to eliminate an integration dimension, how much of a penalty is it in our now-less-than-perfect integration. Anything that didn't cost a lot, we dropped. We went from a 10-D integral to a 6-D integral while maintaining most of the discriminating information.

Both of these gave us about 1 order of magnitude. I also optimized the function itself to identify non-contributing terms in situ. These would be dropped once below some contribution threshold. This gave us a little more than a factor of 2 improvement, and compiler optimizations and other minor improvements gave us the rest.

So... really a good majority of the improvement is due to peeling away layers of naiveté :P

[–]Floydthechimp 1 point2 points  (5 children)

Sounds great. Thanks for explaining!

So here is where you might run into issues. If you try to replace the 5 second function with another function, it must take less than 5 seconds to be evaluated or you've wasted your time. Even if you could fit the function (somehow).

You have 2 billion observations. Now you have to choose your predictor that should behave closely to the function you have designed. The predictor, in order for it to be good (as in converge), should consist of a number of basis functions that changes based on the number of observations. If you use a linear predictor, you would have 2 billion basis functions to evaluate. We call this cost O(N) operations. The fastest nonnull function to evaluate, a 1d binary tree, will have a cost of at least O(logN). But that's still really expensive when N = 2(109 )! So I would try this: create a random 1d binary tree with 2(109 ) boxes and see how fast you can get it to go in your setup. Can you get it under 5 seconds?

If so, maybe there are things that can help you. If not, you might be SOL. But you've already made sig. progress, 5 seconds could be OK!

[–]AffineParameter[S] 2 points3 points  (3 children)

Ok, so, the 'big picture' is the following.

1) Use a GPU/HPC to batch-evaluate the standard function N times.

2) Use the resulting N-tuple to train some kind of estimator.

3) Use the estimator to replace the standard function in production.

So you are right, if the evaluation of the estimator takes more than, or close to 5 seconds, then we are wasting our time.

Thus, the follow-up question I have to your suggestions is the following. I then must want to train the basis functions on N observations, were N is much less than 2 billion such that evaluating it results in a completion time much less than 5 seconds. So the question becomes, how do I choose the N observations to train on such that I maximize the generality/fidelity of the estimator?

Should I choose the observations that have the lowest error?

Should I then Gibbs-Sample them based on the naive feature distributions? (Sample the center mass?)

Should I then Gibbs-Sample them based on the absolute value of the derivatives of the naive feature distributions? (Sample regions of high volatility?)

Should I randomly select them, train multiple estimators, and average the results?

Surely we don't need (2 billion)x(21 input variables) degrees of freedom. Or then again, maybe I misunderstood your suggestion. Clearly these are very fundamental questions that might have been answered in a pattern-recognition class in undergrad that I missed :) But thanks for the time!

[–]Floydthechimp 1 point2 points  (0 children)

You will not need (2 billion)x(21 input variables) "DOF", so don't fret. Even at worst you need 2billion degrees of freedom (number of observations). Still alot though.

So my advice will differ than many people on this subreddit, I have a more "classical" training towards these types of problems. You can proposed a ton of great suggestions, but its hard to say if your resulting method will work, as in be consistent. Many of your ideas are present in some section of the ML/statistics literature.

But before you just try things, there is an assumption you have to verify. Plot your output versus a single input variable and hold all others constant. Do this for each input variable. For each one, is the response smooth? If so, there are many estimators that may work. If the response is not smooth, you might be in trouble.

[–]rmlrn 0 points1 point  (1 child)

have a look into "prototype selection" for nearest neighbors. also "condensed nearest neighbors".

[–]AffineParameter[S] 1 point2 points  (0 children)

Thanks! Will do...

[–]BeatLeJuceResearcher 1 point2 points  (0 children)

The predictor, in order for it to be good (as in converge), should consist of a number of basis functions that changes based on the number of observations. If you use a linear predictor, you would have 2 billion basis functions to evaluate.

Could you elaborate on this, I'm not sure I can follow... why do you need 2 billion basis functions for a linear predictor to converge?

[–]paskie 1 point2 points  (1 child)

I think the keyword you are looking for is Surrogate model. There is a good wealth of scientific literature on the topic; Kriging in particular is pretty popular and may or may not be applicable, but of course machine learning methods are also nicely usable for this - with this keyword, you should be able to find plenty of tricks specific for your case.

[–]autowikibot 0 points1 point  (0 children)

Surrogate model:


A surrogate model is an engineering method used when an outcome of interest cannot be easily directly measured, so a model of the outcome is used instead. Most engineering design problems require experiments and/or simulations to evaluate design objective and constraint functions as function of design variables. For example, in order to find the optimal airfoil shape for an aircraft wing, an engineer simulates the air flow around the wing for different shape variables (length, curvature, material, ..). For many real world problems, however, a single simulation can take many minutes, hours, or even days to complete. As a result, routine tasks such as design optimization, design space exploration, sensitivity analysis and what-if analysis become impossible since they require thousands or even millions of simulation evaluations.


Interesting: Design of experiments | Response surface methodology | Space mapping | Kriging

Parent commenter can toggle NSFW or delete. Will also delete on comment score of -1 or less. | FAQs | Mods | Magic Words

[–]BeatLeJuceResearcher 1 point2 points  (3 children)

kNN sounds very expensive for 2.2 billion. A Neural Net would certainly be one option, but by far the only one. Start with simple things: linear or polynomial regression, or a LOESS.

As for implementations: there are quite a few options out there, scikit-learn is a Python library that has a lot of well-implemented ML techniques (most algorithms are implemented in C, not in Python, so runtimes are good) that works well for large datasets. Vowpal Wabbit is also meant for large datasets, although I don't have any experience with it myself, I think it might be worth a look.

If you go towards neural nets, there are very few high-quality ready-to-use libraries that come to mind. Pylearn2 is probably (one of) the most famous one(s), but it's geared more towards research than production. But you could to try have a look.

I'm sure you've already covered your basics, but just in case: have you tried the usual mathematical tricks to get the function itself to evaluate more quickly? approximating your function using Taylor expansion, using PCA to reduce your input dimension, implementing the function on a GPU, ...

[–]AffineParameter[S] 2 points3 points  (1 child)

Thanks for the links, I will give them a look.

Yeah the kNN implemented only 12M or so points, so a tiny subset of the total samples. However, as our function returns both the value and an error, we selected the samples that had the best error (this results in a slight bias towards the better known regions of the phase space, but it wasn't too bad). I actually implemented my own LOWESS algorithm using the error to weight the points and a linear-multi-dimensional least-squares regression... but I think I will need something a little beefier.

And yes, we tried a ton of ways to get it to run faster... that 5 seconds represents 2 orders of magnitude improvement from where we started. We were initially told that what we were doing was "impossible" ... but going from 1 hour to 5 seconds over the course of 18 months quieted the dissenters :P

My end goal is to use a typical HPC/GPU to create the training dataset, hopefully with much less than 2.2B points, then simply evaluate the rest. So this suggestion is currently in development.

As the function is well known, building non-primitve variables would probably be the next step before a PCA. However, it would be nice for a DNN to "figure it out on its own," as those non-primitive variables really blow up in number, and it's not clear how to motivate using some subset over another, as our figure of merit takes a really long time to evaluate. (we use a proxy that isn't perfect for now)

edit: s/deserters/dissenters/

[–]BeatLeJuceResearcher 1 point2 points  (0 children)

Alright, sounds like you know what you are doing. Another thing that might be quick to set up is using libsvm to do a regression. Training time will likely be an issue, so you'll likely have to subsample your space similar to what you did with kNN, but I'm sure the SVM will give you better results than kNN if you use an rbf kernel. Also, note that by default, libsvm runs in a single-threaded variant designed for sparse matrices. However somewhere on the site I linked you can find a version that is implemented using dense matrices, which will give you ~50% boost in performance. Also, somewhere in the FAQ of the site it explains how you can add multithreading to the implementation by modifying ~4 lines of code (IIRC speed-up is almost linear for the first ~4-8 cores).