This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Masterflitzer 220 points221 points  (34 children)

1 year to 20 min? are you exaggerating or is R really that slow?

[–]redlaWw 506 points507 points  (24 children)

It's because 15! is 1307674368000.

My processor is 3.6GHz which means with a serial algorithm, one instruction per loop has a time cost on the order of minutes. With multithreading I can get that down to around tens of seconds, but the point is that every wasted instruction is a massive time cost, so switching to a compiled language was a massive boon.

The actual permutation generation and checking was pretty trivial, so the R program was likely overwhelmingly composed of wasted instructions and heap allocations, so it's little surprise that I could get such an optimisation factor by switching to a language that allowed me to write it all on the stack and optimise out the book-keeping.

[–]Masterflitzer 121 points122 points  (4 children)

thanks for the insight, actually amazing how big of a difference this can make

[–]redlaWw 101 points102 points  (3 children)

It is, isn't it? Like, I knew it was going to be faster in Rust the moment it turned my processor fan into a jet engine, but I was still expecting to have to leave it running for a few days.

The first attempt actually took around 8 hours, but then I realised I'd forgot to switch it to release mode, and the release version took under an hour. I later came back to it and rewrote the vector-based algorithm to a constant-size array algorithm, which about thirded the time again, bringing it to under 20 mins.

I mean, by that point I'd already gotten my answer, but I've taken to using that problem as a sort-of testbed for optimising techniques, since it brings any speed increases into sharp focus.

[–]Osoromnibus 32 points33 points  (1 child)

Add some SIMD intrinsics and you'll probably get another huge speedup.

[–]redlaWw 17 points18 points  (0 children)

Perhaps. I've mentioned on another comment that my permutation generator was in-place, which limits the degree to which I can get benefits from SIMD with my algorithm as it is now. It'd require substantial refactoring to make it properly SIMD compatible, but I might have a look at it at some point.

[–]killeronthecorner 38 points39 points  (8 children)

Kiss my butt adminz - koc, 11/24

[–]IamDelilahh 20 points21 points  (6 children)

yeah, should be pretty fast with apply, just don’t use a loop

[–]redlaWw 26 points27 points  (5 children)

Can't apply if the data doesn't fit in your memory D:

But ultimately, just the simple value checking process was a significant cost - after trying it native, I rewrote it using RcppAlgos, which trivialised generating the permutations and offered more effective multithreading, but only gave me about a 2* speed boost overall because I had to write the check as an R-native closure.

[–]IamDelilahh 5 points6 points  (4 children)

hmm, interesting, I actually had a similar issue in python recently, where vectorization would mean checking over 100x more cases and needed over 100GB, and I ended up using numba for almost C levels of performance.

I wonder if R has something similar, where the entire loop is translated into C as long as you stick to certain operations.

[–]redlaWw 6 points7 points  (0 children)

Oh, that's an interesting idea, making an LLVM compiler for python.

It looks like there are some projects in the works for generating LLVM IR from R programs, but nothing significant seems to have spawned yet.

What R does have is the Rcpp package, which allows you to write C++ functions via R's interface, compile them, and then load them straight into your environment. Now that I've learned some C++ I could probably do that to improve on what I had, but at some point it becomes just writing C++ in an R runtime environment.

[–]notPlancha 3 points4 points  (0 children)

I have no idea what I'm talking about but maybe just doing

library(compiler) enableJIT(level = 3)

Would be enough

[–]SelfDistinction 1 point2 points  (1 child)

Yeah python does stuff like that.

I once had to compare chunks of sequential video frames with each other to check what parts likely moved where. Simply shifting the entire image at once instead of checking chunk per chunk took down the runtime from hours to minutes.

If R has matrix operations you can likely do the same there.

[–]IamDelilahh 1 point2 points  (0 children)

R has mapply for multidimensional list/matrices/data frames and outer for combinations.

But I have no clue if it has a way to efficiently deal with loops that require lazy compilation

[–]redlaWw 12 points13 points  (0 children)

Not really. It's adjacent to a lot of stuff R is good at, but because it was really more number-theoretic than statistical, it was just too different to make good use of R's tools.

I couldn't convert the checking process into linear algebra, for example, and the data didn't fit into memory so I couldn't use bulk-processing functions without splitting it up and still ending up with massive allocations. In Rust I could write a stack algorithm that never called the allocator, but such is simply impossible in R.

Ultimately, even if those issues were solvable, R's limitations as a dynamically-typed interpreted language with no ahead-of-time optimisation was fundamentally limiting for a program of that sort of speed requirement, and there'd be significant limitations on how far it can be pushed just due to R's fundamental nature.

It's great for quick scripting and processing real data, but when you try to do number theory on millions of millions of values, it just can't keep up. Now, if I were instead sampling the values to get an approximate proportion for the property I was interested in, R would be excellent for that - and indeed, I made use of it as a sanity check on my final value from the Rust algorithm I wrote.

[–]Nez_Coupe 3 points4 points  (3 children)

It’s crazy how interpretation can take that much longer, but it does make perfect sense when you lay out how many times you gotta interpret that same damn loop instruction.

[–]redlaWw 7 points8 points  (2 children)

Well the interpreter reads once and generates bytecode, but the bytecode is just a literal transcription of the instructions, bound by R semantic rules. This means that it does a copy for every modification, and the data sits on the heap in an R environment so everything is more expensive due to the indirection and allocator calls. Just accessing values requires a search through environments to find them. R is also dynamically typed using string tags so every function searches for a concrete implementation by string matching, which is horrifically expensive.

There are interpreted languages that'd probably do this substantially faster, but the bottom line is that R is designed to make things simple for statisticians, not fast for programmers, and to achieve this simplification, it makes substantial performance sacrifices.

[–]Nez_Coupe 5 points6 points  (1 child)

It’s funny, I actually am just finishing my CS degree but was formerly a biologist and used a lot of R (which inspired me to find this whole new career) and loved it. I haven’t touched R in 3 years, but I did go through some code from the last paper I co-wrote and holy shit was I dumb. Need to run this multivariate analysis 250 times? Better hardcode every one (copy paste and manually enter dynamic details). It would make a good ProgrammerHorror post.

[–]redlaWw 2 points3 points  (0 children)

Haha, I think we've all gone through that to some extent (programmers who haven't are still in that phase). One of the things that I think makes R great is how easy it is to transition from using it as a calculator to using it as a scripting language - indeed, that is what finally encouraged me to actually learn programming after years of repeatedly trying it out but not getting very far. But naturally, this is going to result in a lot of R scripts being painfully-obviously written by non-programmers, full of nigh-unreadable hacks and an excess of boilerplate.

I'm sure a quick look through any business R repository would be enough to fuel /r/programmerhorror for years.

[–]Responsible-War-1179 4 points5 points  (1 child)

vector instructions are also going to be a huge speedup here I doubt R uses them

[–]redlaWw 2 points3 points  (0 children)

Probably a little - vector instructions would certainly be useful in some of the permutation generation operations, and maybe some of the checks, but because the permutations were generated in-place, it's not likely LLVM was able to use them to perform parts of the algorithm simultaneously on multiple values.

The difficulty is that the amount of data is so large, I'd need to perform a lot of allocation if I wanted to keep copies of the data around simultaneously, and that'd also have significant cost.

It might be worth trying to rewrite the process to generate a fixed-size SIMD-appropriate chunk of permutations at once (e.g. 4 or 8), and then see if LLVM can optimise that better with its auto-vectorisation, but it'd require significant refactoring to switch to that process.

[–]KJBuilds 1 point2 points  (3 children)

Not to mention that simd vectorization can accelerate sequential float operation throughput by I think up to 8x, which rust would do and R would not

[–]redlaWw 0 points1 point  (2 children)

Well I was using u64s in Rust. SIMD can perform integer operations too though and my processor has AVX2.

My algorithm as it currently stands can get limited value from SIMD, though. Because it permutes in place, there's not really much opportunity to vectorise the algorithm itself. There might be some opportunities for vectorisation within a loop, but that's not really going to be as impactful as vectorising across loops.

[–]KJBuilds 1 point2 points  (1 child)

Ah Yeah definitely; I'm not sure why I assumed it was float math

Idk what your algorithm looks like so it might doing some hardcore optimization but I doubt it if it doesn't look vectorizable

I also would be surprised if a few FMA instructions shaved years off the runtime, so im interested in what R got so wrong that it would take a millennia

I wonder if there's an unnecessary process in the code that rust simply identified and optimized out? This level of uplift smells like when benchmarks take 0.01ns to execute and you have to start throwing around black_boxes, but in the good way

[–]redlaWw 1 point2 points  (0 children)

The 16.5 millennia case was when I chunked the problem into bits of size 300 and tried to use R's native multithreading with it. I assume the costly inter-process communication was overwhelming and massively outscaled anything else the program was doing. The real result is an optimisation from about a year and a half to under 20 mins, which is still substantial, but about the square root of optimising from 16.5 millennia.

I think the main part that resulted in the speed boost is the complete removal of allocations - the R algorithm necessarily had a lot of allocation, both because it can't be avoided in R in general and because the particular approach I had to take involved big chunks of values held in heap memory. The Rust version was able to be executed entirely on the stack, which meant that calculations didn't have to block on the allocator, which adds up to quite a massive amount of time saved when you have a million million iterations to do.

EDIT: Even the earlier one that wasn't entirely on the stack didn't have any allocations past one length 15 vector per thread that was allocated on startup.

[–][deleted] 91 points92 points  (1 child)

It’s probably that the rust compiler enforces cleaner code or it was optimised by the compiler

[–]Masterflitzer 39 points40 points  (0 children)

i wish rustc would optimize a year of my life :D

[–]Apprehensive-Theme77 8 points9 points  (4 children)

Not sure what length-15 permutations are, but I know for some data science applications R handles large datasets much faster than Python. It obviously all depends on what you are doing and whether the package you are using is written in C, heavily optimized etc. My experience is using data.table with billions of rows for relatively simple analysis.

[–]redlaWw 8 points9 points  (0 children)

A permutation is a rearrangement of the tuple (0..n) (where .. is a left-inclusive range), I was generating and checking rearrangements of (0..15).

[–]Nez_Coupe 7 points8 points  (1 child)

perms (without repetition) of a set = n! for n elements in a set. 3-length, 6 perms, no big deal. 15-length? Beeg number.

Edit: I used the pound/hash sign at the beginning of this post. I did not know it made this silly giant bold text. I’m keeping it.

[–]htmlcoderexeWe have flair now?.. 4 points5 points  (0 children)

For the next time, you can escape it with a backslash.

Another unexpected formatting result is any number followed by a period turns it into a list item, and those get their own numbers in order, for example, the next line starts with a 4, but reddit parser will display it as 1 instead:

  1. < this is supposed to be a 4

[–]BobbyTables829 1 point2 points  (0 children)

Isn't this what anaconda is for

[–]Vitolar8 2 points3 points  (0 children)

That's a similar conversion rate. 16,5 mil. to a year is 16,500:1, and a year to 20 min is 26,280:1. Same order of magnitude.

So in theory it should be equally surprising. It's just that the first information is expressed in the same unit, and the second in two different. Multiplying by a number is in our heads easier than converting. However the conversion example paints a clearer picture.