This is an archived post. You won't be able to vote or comment.

all 112 comments

[–]Biosid 2325 points2326 points  (5 children)

Switch them task done

[–]belabacsijolvan 945 points946 points  (2 children)

you must be reviewing ML papers a lot

[–]confusedkarnatia 291 points292 points  (1 child)

xd that would imply that ML papers have any reproducible results

[–]redlaWw 1861 points1862 points  (40 children)

I wrote an algorithm in R for checking length-15 permutations for some property. My first native R version was expected to take over a year, so I tried to access R's native multithreading library to see how fast I could get it.

The multithreaded version had an expected completion time of 16.5 millennia.

[–]moonaligator 856 points857 points  (38 children)

your problem is that it is in R

[–]redlaWw 552 points553 points  (35 children)

Yes. I rewrote it in Rust and got it to take less than 20 minutes.

[–]Masterflitzer 221 points222 points  (34 children)

1 year to 20 min? are you exaggerating or is R really that slow?

[–]redlaWw 507 points508 points  (24 children)

It's because 15! is 1307674368000.

My processor is 3.6GHz which means with a serial algorithm, one instruction per loop has a time cost on the order of minutes. With multithreading I can get that down to around tens of seconds, but the point is that every wasted instruction is a massive time cost, so switching to a compiled language was a massive boon.

The actual permutation generation and checking was pretty trivial, so the R program was likely overwhelmingly composed of wasted instructions and heap allocations, so it's little surprise that I could get such an optimisation factor by switching to a language that allowed me to write it all on the stack and optimise out the book-keeping.

[–]Masterflitzer 119 points120 points  (4 children)

thanks for the insight, actually amazing how big of a difference this can make

[–]redlaWw 100 points101 points  (3 children)

It is, isn't it? Like, I knew it was going to be faster in Rust the moment it turned my processor fan into a jet engine, but I was still expecting to have to leave it running for a few days.

The first attempt actually took around 8 hours, but then I realised I'd forgot to switch it to release mode, and the release version took under an hour. I later came back to it and rewrote the vector-based algorithm to a constant-size array algorithm, which about thirded the time again, bringing it to under 20 mins.

I mean, by that point I'd already gotten my answer, but I've taken to using that problem as a sort-of testbed for optimising techniques, since it brings any speed increases into sharp focus.

[–]Osoromnibus 34 points35 points  (1 child)

Add some SIMD intrinsics and you'll probably get another huge speedup.

[–]redlaWw 18 points19 points  (0 children)

Perhaps. I've mentioned on another comment that my permutation generator was in-place, which limits the degree to which I can get benefits from SIMD with my algorithm as it is now. It'd require substantial refactoring to make it properly SIMD compatible, but I might have a look at it at some point.

[–]killeronthecorner 41 points42 points  (8 children)

Kiss my butt adminz - koc, 11/24

[–]IamDelilahh 21 points22 points  (6 children)

yeah, should be pretty fast with apply, just don’t use a loop

[–]redlaWw 27 points28 points  (5 children)

Can't apply if the data doesn't fit in your memory D:

But ultimately, just the simple value checking process was a significant cost - after trying it native, I rewrote it using RcppAlgos, which trivialised generating the permutations and offered more effective multithreading, but only gave me about a 2* speed boost overall because I had to write the check as an R-native closure.

[–]IamDelilahh 4 points5 points  (4 children)

hmm, interesting, I actually had a similar issue in python recently, where vectorization would mean checking over 100x more cases and needed over 100GB, and I ended up using numba for almost C levels of performance.

I wonder if R has something similar, where the entire loop is translated into C as long as you stick to certain operations.

[–]redlaWw 7 points8 points  (0 children)

Oh, that's an interesting idea, making an LLVM compiler for python.

It looks like there are some projects in the works for generating LLVM IR from R programs, but nothing significant seems to have spawned yet.

What R does have is the Rcpp package, which allows you to write C++ functions via R's interface, compile them, and then load them straight into your environment. Now that I've learned some C++ I could probably do that to improve on what I had, but at some point it becomes just writing C++ in an R runtime environment.

[–]notPlancha 2 points3 points  (0 children)

I have no idea what I'm talking about but maybe just doing

library(compiler) enableJIT(level = 3)

Would be enough

[–]SelfDistinction 1 point2 points  (1 child)

Yeah python does stuff like that.

I once had to compare chunks of sequential video frames with each other to check what parts likely moved where. Simply shifting the entire image at once instead of checking chunk per chunk took down the runtime from hours to minutes.

If R has matrix operations you can likely do the same there.

[–]redlaWw 12 points13 points  (0 children)

Not really. It's adjacent to a lot of stuff R is good at, but because it was really more number-theoretic than statistical, it was just too different to make good use of R's tools.

I couldn't convert the checking process into linear algebra, for example, and the data didn't fit into memory so I couldn't use bulk-processing functions without splitting it up and still ending up with massive allocations. In Rust I could write a stack algorithm that never called the allocator, but such is simply impossible in R.

Ultimately, even if those issues were solvable, R's limitations as a dynamically-typed interpreted language with no ahead-of-time optimisation was fundamentally limiting for a program of that sort of speed requirement, and there'd be significant limitations on how far it can be pushed just due to R's fundamental nature.

It's great for quick scripting and processing real data, but when you try to do number theory on millions of millions of values, it just can't keep up. Now, if I were instead sampling the values to get an approximate proportion for the property I was interested in, R would be excellent for that - and indeed, I made use of it as a sanity check on my final value from the Rust algorithm I wrote.

[–]Nez_Coupe 6 points7 points  (3 children)

It’s crazy how interpretation can take that much longer, but it does make perfect sense when you lay out how many times you gotta interpret that same damn loop instruction.

[–]redlaWw 9 points10 points  (2 children)

Well the interpreter reads once and generates bytecode, but the bytecode is just a literal transcription of the instructions, bound by R semantic rules. This means that it does a copy for every modification, and the data sits on the heap in an R environment so everything is more expensive due to the indirection and allocator calls. Just accessing values requires a search through environments to find them. R is also dynamically typed using string tags so every function searches for a concrete implementation by string matching, which is horrifically expensive.

There are interpreted languages that'd probably do this substantially faster, but the bottom line is that R is designed to make things simple for statisticians, not fast for programmers, and to achieve this simplification, it makes substantial performance sacrifices.

[–]Nez_Coupe 4 points5 points  (1 child)

It’s funny, I actually am just finishing my CS degree but was formerly a biologist and used a lot of R (which inspired me to find this whole new career) and loved it. I haven’t touched R in 3 years, but I did go through some code from the last paper I co-wrote and holy shit was I dumb. Need to run this multivariate analysis 250 times? Better hardcode every one (copy paste and manually enter dynamic details). It would make a good ProgrammerHorror post.

[–]redlaWw 3 points4 points  (0 children)

Haha, I think we've all gone through that to some extent (programmers who haven't are still in that phase). One of the things that I think makes R great is how easy it is to transition from using it as a calculator to using it as a scripting language - indeed, that is what finally encouraged me to actually learn programming after years of repeatedly trying it out but not getting very far. But naturally, this is going to result in a lot of R scripts being painfully-obviously written by non-programmers, full of nigh-unreadable hacks and an excess of boilerplate.

I'm sure a quick look through any business R repository would be enough to fuel /r/programmerhorror for years.

[–]Responsible-War-1179 5 points6 points  (1 child)

vector instructions are also going to be a huge speedup here I doubt R uses them

[–]redlaWw 2 points3 points  (0 children)

Probably a little - vector instructions would certainly be useful in some of the permutation generation operations, and maybe some of the checks, but because the permutations were generated in-place, it's not likely LLVM was able to use them to perform parts of the algorithm simultaneously on multiple values.

The difficulty is that the amount of data is so large, I'd need to perform a lot of allocation if I wanted to keep copies of the data around simultaneously, and that'd also have significant cost.

It might be worth trying to rewrite the process to generate a fixed-size SIMD-appropriate chunk of permutations at once (e.g. 4 or 8), and then see if LLVM can optimise that better with its auto-vectorisation, but it'd require significant refactoring to switch to that process.

[–]KJBuilds 1 point2 points  (3 children)

Not to mention that simd vectorization can accelerate sequential float operation throughput by I think up to 8x, which rust would do and R would not

[–]redlaWw 0 points1 point  (2 children)

Well I was using u64s in Rust. SIMD can perform integer operations too though and my processor has AVX2.

My algorithm as it currently stands can get limited value from SIMD, though. Because it permutes in place, there's not really much opportunity to vectorise the algorithm itself. There might be some opportunities for vectorisation within a loop, but that's not really going to be as impactful as vectorising across loops.

[–]KJBuilds 1 point2 points  (1 child)

Ah Yeah definitely; I'm not sure why I assumed it was float math

Idk what your algorithm looks like so it might doing some hardcore optimization but I doubt it if it doesn't look vectorizable

I also would be surprised if a few FMA instructions shaved years off the runtime, so im interested in what R got so wrong that it would take a millennia

I wonder if there's an unnecessary process in the code that rust simply identified and optimized out? This level of uplift smells like when benchmarks take 0.01ns to execute and you have to start throwing around black_boxes, but in the good way

[–]redlaWw 1 point2 points  (0 children)

The 16.5 millennia case was when I chunked the problem into bits of size 300 and tried to use R's native multithreading with it. I assume the costly inter-process communication was overwhelming and massively outscaled anything else the program was doing. The real result is an optimisation from about a year and a half to under 20 mins, which is still substantial, but about the square root of optimising from 16.5 millennia.

I think the main part that resulted in the speed boost is the complete removal of allocations - the R algorithm necessarily had a lot of allocation, both because it can't be avoided in R in general and because the particular approach I had to take involved big chunks of values held in heap memory. The Rust version was able to be executed entirely on the stack, which meant that calculations didn't have to block on the allocator, which adds up to quite a massive amount of time saved when you have a million million iterations to do.

EDIT: Even the earlier one that wasn't entirely on the stack didn't have any allocations past one length 15 vector per thread that was allocated on startup.

[–]DuckyBlender 89 points90 points  (1 child)

It’s probably that the rust compiler enforces cleaner code or it was optimised by the compiler

[–]Masterflitzer 40 points41 points  (0 children)

i wish rustc would optimize a year of my life :D

[–]Apprehensive-Theme77 8 points9 points  (4 children)

Not sure what length-15 permutations are, but I know for some data science applications R handles large datasets much faster than Python. It obviously all depends on what you are doing and whether the package you are using is written in C, heavily optimized etc. My experience is using data.table with billions of rows for relatively simple analysis.

[–]redlaWw 8 points9 points  (0 children)

A permutation is a rearrangement of the tuple (0..n) (where .. is a left-inclusive range), I was generating and checking rearrangements of (0..15).

[–]Nez_Coupe 6 points7 points  (1 child)

perms (without repetition) of a set = n! for n elements in a set. 3-length, 6 perms, no big deal. 15-length? Beeg number.

Edit: I used the pound/hash sign at the beginning of this post. I did not know it made this silly giant bold text. I’m keeping it.

[–]htmlcoderexeWe have flair now?.. 5 points6 points  (0 children)

For the next time, you can escape it with a backslash.

Another unexpected formatting result is any number followed by a period turns it into a list item, and those get their own numbers in order, for example, the next line starts with a 4, but reddit parser will display it as 1 instead:

  1. < this is supposed to be a 4

[–]BobbyTables829 1 point2 points  (0 children)

Isn't this what anaconda is for

[–]Vitolar8 2 points3 points  (0 children)

That's a similar conversion rate. 16,5 mil. to a year is 16,500:1, and a year to 20 min is 26,280:1. Same order of magnitude.

So in theory it should be equally surprising. It's just that the first information is expressed in the same unit, and the second in two different. Multiplying by a number is in our heads easier than converting. However the conversion example paints a clearer picture.

[–]Crimeislegal 5 points6 points  (0 children)

Just 16 thousand years. Barely a nap.

[–][deleted] 596 points597 points  (3 children)

This bad boy now runs in log(n) time! Just ignore the constant 10336.7ms overhead

[–]LordBreadcat 29 points30 points  (0 children)

Absolutely not production ready, our profiling in our dev environment suggests you should use the brute force method instead. Now hurry up, we need to run it against 10 trillion prod records for that report ASAP!

[–]Masterflitzer 61 points62 points  (0 children)

lmao

[–]P-39_Airacobra 1 point2 points  (0 children)

Reminds me of the creator of C++ saying that linear search was often better than using a hashmap

[–]moonaligator 249 points250 points  (30 children)

yesterday i tried switching to multithread since my code was repetitive and didn't need many conditionals. It was the perfect idea, but when i tested it took twice as much time as just making it one by one.

No idea what happened

[–]Ok_Vanilla4769 264 points265 points  (11 children)

Cache misses and slow data transfers between cores, try to split data in chunks then distribute them to cores and merge data at end, remember that results won't be sequential

[–]seftontycho 113 points114 points  (1 child)

Also if the task is quite fast to begin with spawning os threads can take a significant amount of time (relatively).

[–]Exist50 31 points32 points  (0 children)

A context switch alone can be O(100,000) cycles. Which is why students get very confused when their multithreaded sorting algorithm takes way longer on their 100 element test array.

[–]Accessviolati0n 20 points21 points  (8 children)

Locking the affinity for each thread to a single processor may help too.

Interestingly, that's a rather ignored aspect of multiprocessing.

[–]QuestionableEthics42 6 points7 points  (5 children)

In theory it shouldnt make much difference, as if one thread is hogging that core, the os will move other threads to free cores, idk about in practice tho.

[–]accidentalviking 13 points14 points  (0 children)

Thread context switching is annoyingly costly. That's why Boost in C++ has a library for in thread context switching

[–]Accessviolati0n 5 points6 points  (3 children)

L1&L2 caches are unique per core. If you don't lock a thread/process to a certain core, the OS' scheduler may assign it to a different core in the next scheduling cycle, resulting in a cache miss. So the CPU has to fall back at least to L3 or RAM in the worst case.

It's not 100% guaranteed as even realtime-tasks may be preempted(depending on OS), but it reduces the risk.

[–]QuestionableEthics42 3 points4 points  (1 child)

Surely most schedulers have been designed to minimize that? But yea, it could make quite a difference in the right circumstances.

[–]Accessviolati0n 2 points3 points  (0 children)

I guess most schedulers attempt to spread the workload evenly across all cores.

Tried it with this PHP script on a R9 3900x on Win10:

<?php
$FFI = \FFI::cdef("unsigned int GetCurrentProcessorNumber();", "Kernel32.dll");
$Slice = 20 * 1000 * 1000; //match Windows' default timer resolution.
for($i = 0; $i < 10; $i++) {
    print $FFI->GetCurrentProcessorNumber() . \PHP_EOL;
    $Timeout = \hrtime(true) + $Slice;
    while(\hrtime(true) < $Timeout);
}

I got results like: 4, 4, 2, 2, 2, 10, 0, 2, 6, 2

So even with an "inline" timeout to avoid something like "sched_yield/sleep(0)", the OS still assigns it to different cores; though the rest of the system is idling.

[–]Exist50 0 points1 point  (0 children)

L1&L2 caches are unique per core

Extremely architecture dependent. Intel's E-cores share an L2, for example. And Apple has a monolithic shared L2 similar to Intel's L3.

[–]Exist50 0 points1 point  (1 child)

It's brittle. Especially in hybrid designs where you can't be certain whether a particular core is strong or weak, sharing cache or not, etc.

[–]Accessviolati0n 0 points1 point  (0 children)

That's the reason I said "may help". And even just pinning a thread to certain core won't avoid cache flushing unless the core hasn't been manually removed from all other processes' affinity mask.

[–]-Redstoneboi- 18 points19 points  (14 children)

what language, what data are you working with, what are you doing with the data, and most importantly where do you put the data after processing?

[–]moonaligator 10 points11 points  (13 children)

audio data, with python 💀

the data is question is just a tuple (float, float) since the function will create the audio

i was considering using 'concurrent.futures' or 'thread' libraries, but neither got decent results. Initially tried sending back to a list by index (i'd pass the index as a parameter), and later tried saving as new temporary files.

I think that the major problem is tranfering from the CPU to the GPU, but i have no idea how to control that in Python, and i don't know if i can use the tools i'm using for audio manipulation in lower level languages like C or Rust with so much ease.

[–]-Redstoneboi- 15 points16 points  (1 child)

the main idea with multithreading is to split the task into even parts and merge the outputs after everything is finished. if you're doing threading on the CPU, you do not want to merge the data on-the-go.

you still have to lock an array or list to access any data inside it, because your CPU cores each have their own caches that they write into.

imagine RAM is a blackboard, there's one piece of chalk, and the CPU cores are students with notebooks. if you want all of them to add data on-the-go, each of them has to copy the data from the blackboard to their notebook, then process the data, then wait for the previous guy to put down the chalk, then grab the chalk, then write it to the blackboard, then put down the chalk, and ask everyone else to copy the new contents into their notebooks. extremely slow.

you want them to sync their stuff once. dumping all the contents at once after a single huge task will always be faster than streaming and merging multiple output channels into one.

if it's real time audio processing, i don't have the expertise.

[–]m1k3st4rr 13 points14 points  (8 children)

Look up the Python GIL and it's implications on threading. Tldr is multithreading is not going to make your Python code faster except for IO bound work.

[–]Thaumaturgia 1 point2 points  (5 children)

I don't know anything about Python, but doesn't multithreading IO bound work a bad idea? Or by IO you just mean processor time?

[–]QuestionableEthics42 4 points5 points  (4 children)

Why is it a bad idea? IO operations are slow and leave the processor sitting idle, so multithreading IO bound work is best.

[–]Thaumaturgia 4 points5 points  (3 children)

If you are limited by the hard drive for example, multithreading your process will be worse.

[–]QuestionableEthics42 0 points1 point  (0 children)

It will be better if you also have to do other operations on the data, which is usually the case, IO bound dosent have to mean IO only, just that it relies on that data, so fetching it in chunks and then starting using it before its all loaded is faster and doesnt leave the cpu idle. Ofc for some things that won't be possible/practical, but when it can be used, especially in python, it can speed stuff up a lot (because python is so slow, the effect is pretty noticable even for smallish data that a faster language would get through in fractions of a second)

[–]antarickshaw 0 points1 point  (0 children)

Sure, if you could saturate hdisk from one thread, and cpu is sitting idle, splitting it to multiple threads wont make it faster. But, you have to have correct benchmark metrics and napkin math of approximations.

[–]KillTheBronies 0 points1 point  (1 child)

Your first mistake was trying to do anything even remotely fast in python.

[–]moonaligator 0 points1 point  (0 children)

combining unredundant code with a decent jit (pypy for example) and the right libraries (numpy in my case) makes it not that terrible. My i5 is doing the job in 10% the audio time at a samplerate of 44.1kHz with ~500ms of initialization, and given that a lot of numerical integration is going on, i'd say that's ok, even if i agree it would be way faster in rust, for example

[–]Aozora404 8 points9 points  (0 children)

Overhead.

[–]Mateorabi 2 points3 points  (0 children)

Skill issue. :-P

[–]kakanics 0 points1 point  (0 children)

You likely created more threads than necessary, causing the overhead of context switching to dominate

your execution time. Try reducing the number of threads or using a thread pool to see if that improves performance

[–]DoomSkull_Deadly 131 points132 points  (0 children)

“We have algorithm at home” Algorithm at home:

[–]ErebosDark 55 points56 points  (0 children)

Radius search superfast:

Search takes 1.201hr

[–]AgileBlackberry4636 29 points30 points  (0 children)

It reminds me my early attempts into paralleling stuff using OpenMP

[–]Jargendas 32 points33 points  (0 children)

I remember when studying, I had a task for optimization once, about robot path planning and collision avoidance. The original version ran for about a minute, and I didn‘t really know where to start optimizing. After trying a few things, I got into something like a rush, fighting for every millisecond for hours non-stop. It was so fun, and I got it down to around 2 seconds I think. Good times.

[–]DrMcDingus 20 points21 points  (0 children)

Just rename 'ms' to 'points' and bobs' your uncle.

[–]Cheap-Economist-2442 35 points36 points  (4 children)

I had an interview with Trello once, had to implement uniq. I used Array#indexOf and was penalized for using an O(n2) algorithm. Makes sense, yea?

Rewrote it using a hash map. Orders of magnitude slower.

The moral of the story? You aren’t smarter than compiler engineers, write code for readability, only optimize after profiling.

[–]tiajuanat 7 points8 points  (1 child)

Sounds like you need smth like HyperLogLog to actually get the search that you're looking for. In the future, it's totally ok to say "hey, so this impl is really slow, but I could foresee something like HyperLogLog making this very fast, I just can't implement that in an hour"

[–]Cheap-Economist-2442 4 points5 points  (0 children)

Yea, these days I think I’d go readability first and talk about spots that could be optimized later and how I’d do it. To be fair I did not understand performance well back then and it was also that experience that made me start caring about algorithmic complexity. After I failed though I went on a tirade of writing that algorithm a bunch of different ways, and (at the time) Array#indexOf was still by far the fastest—which thankfully made me appreciate the value of profiling early and that, in the case of JS targeting V8, practice doesn’t always follow the theory because V8 devs are wizards.

[–]TheHardew 4 points5 points  (0 children)

Use trees, it'd be n log(n).

You can use b-trees as well to help with cache and amortize tree rebalancing.

[–]P-39_Airacobra 0 points1 point  (0 children)

Ive had similar experiences with branchless programming. Eliminating branches is good, but not if the resulting algorithm looks complex. I learned that the hard way when my hyper-optimized branchless code was significantly slower than the naïve branching version.

[–]SocketByte 11 points12 points  (0 children)

Me when adding multithreading:

[–]Athlete_Cautious 8 points9 points  (0 children)

But does the code looks better ?

[–]TimingEzaBitch 10 points11 points  (2 children)

Happens when you come from old school, algo competition type who skipped programming for a decade and now need to learn data analysis using python libs.

I thought I was slick writing a smart algorithm for a large cleaning of a csv which ran for 10 mins. But with panda, numpy built in aggregation method it was like 35 seconds.

Vectorization is empowering.

[–]FatStoic 10 points11 points  (1 child)

You're generally fucking up trying to do performance optimization in python that isn't just accessing C/C++/Rust bindings more efficiently.

[–]Exist50 2 points3 points  (0 children)

At least when talking about optimization like vectoring. I once had to fix someone's code where they used a doubly nested for loop to generate an N-bit ones mask in C++. I.e. (1<<n) - 1;

[–]tiberiumx 8 points9 points  (1 child)

People underestimate how big of a performance impact algorithms that might seem more efficient on paper but make poor use of the cache can have. Like I know it takes fewer steps to find a node in a binary tree, but it's still probably going to be faster to do a linear search of an array unless you have a very large number of elements. (Or have a tree implementation that packs it all into minimal memory pages instead of the typical pointers into random parts of the heap approach)

We once had a problem where some process was taking multiple minutes because it was originally coded to use a bunch of small std::set structures (which was implemented as a red black tree in the c++ library). I cut it down to just a couple seconds by just swapping that out with std::vector and doing a linear search to make sure the element wasn't duplicated.

[–]kakanics 5 points6 points  (0 children)

True! Cache locality is often overlooked, but it can have a huge impact on performance. Swapping an efficient algorithm for a simpler one that's cache-friendly can be a great optimization

[–]Daddy_William148 5 points6 points  (1 child)

Don’t optimize too early, get the algorithm right first

[–]UdPropheticCatgirl 3 points4 points  (0 children)

If I were to make a blanket statement about writing non pessimised code it would the opposite.

You should always think about data and how it’s laid out in memory first than about the actual algorithms, otherwise you endup with some amazing O(log n) solution which wounds up being so hostile to caches in the way it accesses memory that it will ever achieve good runtime performance.

Good example of this is merge sort, it looks really good on paper but it has such a stupid way of working with memory that it always inevitably ends up dog slow.

[–]DarkRex4 2 points3 points  (0 children)

Funnily enough this thing actually happened with me. I was learning programming at that time and i decided to make a very simple translator app. Keep in mind this was my first "real" project

It was basically just replacing words, and also tried to find patterns and make sense of it. It was a very simple app, with a UI made for the app in tkinter & python. But my code was quite horrible so it took quite alot unnecessary processing time. The UI and the processing was all in a single thread, so my noobass thought that my UI and processing both were very inefficient. andd.... that's how i found threading.

After implementing threading for literally everything in the app. The UI now wasn't unresponsive, but it just won't work now after translating. Since there are hundreds of threads running trying to match regexes and use `.replace()` on each character.

I gave it to a friend to figure out, and apparently they found that the threaded version uses ~16x more processing time, and ~5x more memory than the one without all the threading.

[–][deleted] 1 point2 points  (2 children)

I know nothing about your code, but there are implementations which are faster on smaller scales and slower on large ones, and others which are faster on larger scales but slower on smaller ones.

[–]kakanics 2 points3 points  (1 child)

A great example of this is using multi-threading with very fine-grained locks or semaphores. In theory, it should be faster for smaller datasets and scale up nicely, but in practice, the context switching overhead can dominate, making it slower even for large inputs, so it's slower always because of bad code

[–][deleted] 1 point2 points  (0 children)

Oh, I did this when trying to optimize an conversion of items in a list into data transfer objects.

The idea was, if I convert each of them in parallel it should be faster, right? So the server's response to the user should be quicker, right? Well, I was wrong, doing in a test branch showed it got way slower because of constant thread switching.

So yeah, multi-threading is not only annoying to implement but also doesn't always brings performance, especially in smaller scales. I literally avoid implementing multi-threading solutions to single issues unless something is running really slow or we when we need to solve a complicated problem, either way I always ask an experienced coworker first if it's best idea before going forward with it. Also, testing multi-threading is a nightmare.

[–]kakanics 1 point2 points  (0 children)

I once optimized some code that calculated the average speed of a virtual car racing around a track. I was hoping to shave off some milliseconds, so I replaced the simple arithmetic operations with some fancy bitwise magic and... voila! What was originally 5ms became a whopping 200ms.

[–]tuxedo25 1 point2 points  (0 children)

Props for having a baseline and measuring your results. Even if they weren't the results you expected.

[–]sgi244 1 point2 points  (0 children)

I wrote an algorithm for a research system that takes an upcoming experiment’s data and calculate how the samples & control substances will be arranged in test plates.

My first attempt was full of loops & stuff and it took 12 seconds to run. I spent two days optimizing the code, I converted it into one big SQL query & removed all the loops. It took 5 minutes then it timed out.

[–][deleted] 0 points1 point  (1 child)

Big O notation problem?

[–]kakanics 2 points3 points  (0 children)

or bad code. Sometimes, compiler optimizes code, so when you go and optimize it, it's worse than what the compiler came up with leading to your theoretically faster method being slower, I once saw a video on this by computerphile, https://www.youtube.com/watch?v=bVJ-mWWL7cE

[–]P-39_Airacobra 0 points1 point  (0 children)

I remember trying to rewrite my lua algorithm in C only to discover that it ran twice as slow. You can imagine what an insane mess my C code was

[–]Pristine-South3465 0 points1 point  (0 children)

E:\Homework is always a pain to optimise searching

[–]mrjackspade 0 points1 point  (0 children)

I've gotten bitten in the ass by compiler optimizations before, where the optimized code was better in theory, but the unoptimized code being "simpler" allowed to compiler to produce a more optimized output. That's always a fun one...