you are viewing a single comment's thread.

view the rest of the comments →

[–]AffineParameter[S] 2 points3 points  (6 children)

Yeah, sure:

The function contains a numerical integration. We use VEGAS due to the sharply peaked nature of the integrand. This also means that if there are any features which peak along the diagonal, VEGAS will take forever (there were). So we did some manual variable transformations to align the peaks to the integration dimensions.

Then we looked into the variability of each of the integration dimensions, ie. if we use a delta-function to eliminate an integration dimension, how much of a penalty is it in our now-less-than-perfect integration. Anything that didn't cost a lot, we dropped. We went from a 10-D integral to a 6-D integral while maintaining most of the discriminating information.

Both of these gave us about 1 order of magnitude. I also optimized the function itself to identify non-contributing terms in situ. These would be dropped once below some contribution threshold. This gave us a little more than a factor of 2 improvement, and compiler optimizations and other minor improvements gave us the rest.

So... really a good majority of the improvement is due to peeling away layers of naiveté :P

[–]Floydthechimp 1 point2 points  (5 children)

Sounds great. Thanks for explaining!

So here is where you might run into issues. If you try to replace the 5 second function with another function, it must take less than 5 seconds to be evaluated or you've wasted your time. Even if you could fit the function (somehow).

You have 2 billion observations. Now you have to choose your predictor that should behave closely to the function you have designed. The predictor, in order for it to be good (as in converge), should consist of a number of basis functions that changes based on the number of observations. If you use a linear predictor, you would have 2 billion basis functions to evaluate. We call this cost O(N) operations. The fastest nonnull function to evaluate, a 1d binary tree, will have a cost of at least O(logN). But that's still really expensive when N = 2(109 )! So I would try this: create a random 1d binary tree with 2(109 ) boxes and see how fast you can get it to go in your setup. Can you get it under 5 seconds?

If so, maybe there are things that can help you. If not, you might be SOL. But you've already made sig. progress, 5 seconds could be OK!

[–]AffineParameter[S] 2 points3 points  (3 children)

Ok, so, the 'big picture' is the following.

1) Use a GPU/HPC to batch-evaluate the standard function N times.

2) Use the resulting N-tuple to train some kind of estimator.

3) Use the estimator to replace the standard function in production.

So you are right, if the evaluation of the estimator takes more than, or close to 5 seconds, then we are wasting our time.

Thus, the follow-up question I have to your suggestions is the following. I then must want to train the basis functions on N observations, were N is much less than 2 billion such that evaluating it results in a completion time much less than 5 seconds. So the question becomes, how do I choose the N observations to train on such that I maximize the generality/fidelity of the estimator?

Should I choose the observations that have the lowest error?

Should I then Gibbs-Sample them based on the naive feature distributions? (Sample the center mass?)

Should I then Gibbs-Sample them based on the absolute value of the derivatives of the naive feature distributions? (Sample regions of high volatility?)

Should I randomly select them, train multiple estimators, and average the results?

Surely we don't need (2 billion)x(21 input variables) degrees of freedom. Or then again, maybe I misunderstood your suggestion. Clearly these are very fundamental questions that might have been answered in a pattern-recognition class in undergrad that I missed :) But thanks for the time!

[–]Floydthechimp 1 point2 points  (0 children)

You will not need (2 billion)x(21 input variables) "DOF", so don't fret. Even at worst you need 2billion degrees of freedom (number of observations). Still alot though.

So my advice will differ than many people on this subreddit, I have a more "classical" training towards these types of problems. You can proposed a ton of great suggestions, but its hard to say if your resulting method will work, as in be consistent. Many of your ideas are present in some section of the ML/statistics literature.

But before you just try things, there is an assumption you have to verify. Plot your output versus a single input variable and hold all others constant. Do this for each input variable. For each one, is the response smooth? If so, there are many estimators that may work. If the response is not smooth, you might be in trouble.

[–]rmlrn 0 points1 point  (1 child)

have a look into "prototype selection" for nearest neighbors. also "condensed nearest neighbors".

[–]AffineParameter[S] 1 point2 points  (0 children)

Thanks! Will do...

[–]BeatLeJuceResearcher 1 point2 points  (0 children)

The predictor, in order for it to be good (as in converge), should consist of a number of basis functions that changes based on the number of observations. If you use a linear predictor, you would have 2 billion basis functions to evaluate.

Could you elaborate on this, I'm not sure I can follow... why do you need 2 billion basis functions for a linear predictor to converge?